Secure your environment with UiPath and CyberArk technologies - Session 1
ZFS for Databases
1. Delphix Agile Data Platform
ZFS for Databases
Adam Leventhal
CTO, Delphix
@ahl
2. Definition 1: ZFS Storage Appliance (ZSA)
• Shipped by Sun in 2008
• Originally the Sun Storage 7000
2
3. Definition 2: Filesystem for Solaris
• Filesystem developed in the Solaris Kernel Group
• First shipped in 2006 as part of Solaris 10 u2
• The engine for the ZSA
• Always consistent on disk (no fsck)
• End-to-end (strong) checksumming
• Snapshots are cheap to create; no practical limit
• Built-in replication
• Custom RAID (RAID-Z)
3
4. Definition 3: OpenZFS
• Sun open sourced ZFS in 2006
• Oracle closed it in 2010
• OpenZFS has continued
• Many of the same developers
– Many left Oracle for companies innovating around OpenZFS
• Expanded beyond Solaris
– Active OpenZFS ports on Linux, FreeBSD, Mac OS X
• Significant evolution
– Many critical bugs fixed
– Test framework, CLI improvements, progress report and
resumability for replication, lz4, simpler API, etc.
– Big emphasis on data driven performance enhancements
4
5. This Talk
• First, which ZFS? The filesystem one.
– Most will apply to both Oracle Solaris ZFS and OpenZFS
• Benefits of ZFS
• Practical considerations: storage pool and dataset layout
• One highly relevant area of performance analysis
5
6. Who am I?
• Joined the Solaris Kernel Group in 2001
• One of the three developers of DTrace
• Added double- and triple-parity RAID-Z to ZFS
• Founding member of the ZSA team (Fishworks) in 2006
• Joined Delphix in 2010
–
–
–
–
Founded in 2008 using ZFS as a component
Virtualize the database
Database copies become as cheap and flexible as VMs
Agile data for faster projects, more efficient devs, and happier
DBAs
– Now the leader in ZFS expertise
– Founded the OpenZFS project
– Also: UKOUG TECH13 sponsor; check out our booth; drinks
6
7. Why ZFS for Databases?
• Modern – in development for over 12 years
• Stable – in production for over 7 years
• Strong data integrity
• No practical limit on snapshots or clones
• Not all good news:
– Random writes turn into sequential writes
– Sequential reads turn into random reads
– (Like NetApp/WAFL)
7
8. RAID-Z
• Traditional RAID-5/6/7 requires NV-RAM to perform
• RAID-Z always writes full, variable-width stripes
• Particularly good for cheap disks
Oracle Solaris ZFS implements an improvement on RAID-5, RAID-Z3, which uses
parity, striping, and atomic operations to ensure reconstruction of corrupted
data even in the face of three concurrent drive failures. It is ideally suited for
managing industry standard storage servers.*
• Not strictly better
– Individual records are split between disks
– RAID-5/6/7 -- a random read translates to a single disk read
– RAID-Z – a random read becomes many disk ops (like RAID-3)
*www.oracle.com/us/products/servers-storage/solaris/solaris-zfs-ds-067320.pdf
8
9. Datasets for Oracle
• Filesystems (datasets) cheap/easy to create in ZFS
• Key settings
– recordsize – atomic unit in ZFS; match Oracle block size (8K)
– logbias={latency,throughput} – QoS hint
– primarycache={none,metadata,all} – caching hint
# zfs create -o recordsize=8k -o logbias=throughput pool/datafiles
# zfs create -o recordsize=8k -o logbias=throughput pool/temp
# zfs create –o primarycache=metadata pool/archive
# zfs create pool/redo
# zfs list -o name,recordsize,logbias,primarycache
NAME
RECSIZE LOGBIAS PRIMARYCACHE
...
pool/archive
128K latency
metadata
pool/datafiles
8K throughput
all
pool/redo
128K latency
all
pool/temp
8K throughput
all
9
11. Oracle Solaris ZFS Write Throttle
• Basic problem: limit rate of input to rate of output
• Originally no write throttle: consume all memory, then wait
• ZFS composes transactions into transaction groups
• Idea: limit the size of a transaction group
• Figure out the backend throughput; target a few seconds
11
12. ZFS Write Throttle Problems
• Transaction group full? Start writing it out
• One already being written out? Wait
• And it can be a looooong wait
• Solution?
– When the transaction group is 7/8ths full, delay for 10ms
– Didn’t guess that did you?
12
15. Oracle Solaris ZFS Tuning
• IO queue depth zfs_vdev_max_pending
–
–
–
–
Default of 10 – may be reasonable for spinning disks
ZFS on a SAN? 24 - 100
Higher for additional throughput
Lower for reduced latency
• Transaction group duration zfs_txg_synctime
– Default of 5 seconds
– Higher for more metadata amortization
– Lower for a smaller window for data loss with non-synced writes
15
16. Back to the ZFS Write Throttle
• Measure of IO throughput swings wildly:
# dtrace -n 'BEGIN{ start = timestamp; } fbt::dsl_pool_sync:entry/stringof(args[0]->dp_spa->spa_name) ==
"domain0"/{ @[(timestamp - start) / 1000000000] = min(args[0]->dp_write_limit / 1000000); }' –xaggsortkey
dtrace: description 'BEGIN' matched 2 probes
…
14
487
15
515
16
515
17
557
18
581
19
581
20
617
21
617
22
635
23
663
24
663
…
• Many factors impact the measured IO throughput
• The wrong guess can lead to massive delays
16
17. OpenZFS I/O Scheduler
• Throw out the ZFS write throttle and IO queue
• Queue depth and throttle based on quantity of modified
data
20
18
16
14
12
10
Queue Depth
8
Delay
6
4
2
0
0
10
20
30
40
50
60
70
80
90
100
• Result: smooth, single-moded write latency
17
18. OpenZFS I/O Scheduler Tuning
• Tunables that area easier to reason about
–
–
–
–
zfs_vdev_async_write_max_active (default: 10)
zfs_dirty_data_max (default: min(memory/10, 4GB))
zfs_delay_max_ns (default: 100µs)
zfs_delay_scale (delay curve; default: 500µs/op)
18
19. Summing Up
• ZFS is great for databases
– Storage Appliance, Oracle Solaris, OpenZFS
• Important best practices
• Beware the false RAID-Z idol
• Measure, measure, measure
– DTrace is your friend (Wednesday 11:00am Exchange 1)
19