This slide was presented at Mydbops Database Meetup 4 by Bajranj ( Zenefits ). ZFS as a filesystem has good features that can enhance MySQL by compression, Quick Snapshots and others.
2. ZFS Principles
● Pooled storage
● Completely eliminates the antique notion of volumes
● Does for storage what VM did for memory
● Transactional object system
● Always consistent on disk – no fsck, ever
● Provable end-to-end data integrity
● Detects and corrects silent data corruption
● Simple administration
● Concisely express your intent
3. FS/Volume Model vs Pooled Storage
Traditional Volumes
● Abstraction: virtual disk
● Partition/volume for each FS
● Grow/shrink by hand
● Each FS has limited bandwidth
● Storage is fragmented, stranded
ZFS Pooled Storage
● Abstraction: malloc/free
● No partitions to manage
● Grow/shrink automatically
● All bandwidth always available
● All storage in the pool is shared
Storage PoolVolume
FS
Volume
FS
Volume
FS ZFS ZFS ZFS
5. Benefits of ZFS
● Copy-on-Write (CoW) File System.
● Throttles writes.
● Data integrity and resiliency.
● Self Healing of Data on ZFS.
● Block size matching.(Allows Variable Block size)
● Snapshots & Clones
● Active development community
7. Block Pointer Structure in ZFS
First copy of data
When the
block was
written
Checksum of
data this block
points to
padding
physical birth txg
logical birth txg
fill count
256-bit checksum
BDX lvl type PSIZEcomp LSIZE
offset1
offset2
offset3
vdev1
vdev2
vdev3
ASIZE
ASIZE
ASIZE
cksum
Second copy of data
(for metadata)
Third copy of data
(pool-wide metadata)
8. END-to-END Data Integrity in ZFS
ZFS validates the entire I/O path
✓ Bit rot
✓ Phantom writes
✓ Misdirected reads and writes
✓ DMA parity errors
✓ Driver bugs
✓ Accidental overwrite
Disk checksum only validates media
✓ Bit rot
✓ Phantom writes
✓ Misdirected reads and writes
✓ DMA parity errors
✓ Driver bugs
✓ Accidental overwrite
Disk Block Checksums
● Checksum stored with data block
● Any self-consistent block will pass
● Can't detect stray writes
● Inherent FS/volume interface limitation
Data Data
Data
Checksum
Data
Checksum
ZFS Data Authentication
● Checksum stored in parent block pointer
● Fault isolation between data and checksum
● Checksum hierarchy forms
self-validating Merkle tree
Address
Checksum Checksum
Address
• • •
Address
Checksum Checksum
Address
9. Self Healing of Data in ZFS
Application
ZFS mirror
Application
ZFS mirror
Application
ZFS mirror
1. Application issues a
read. Checksum reveals
that the block is corrupt
on disk.
2. ZFS tries the next
disk. Checksum
indicates that the block
is good.
3. ZFS returns good
data to the application
and repairs the damaged
block.
10. Initial Use case at Zenefits
We use AWS snapshot to rebuild a new DB for dev/ops; the first access to
the data is slow because “New volumes created from existing EBS
snapshots load lazily in the background”
Multiple DB clusters data needed for generating the DB for dev/ops -- We
use Multi-Source Replication.
11. Alternatives
Multiple EBS Volume attached as Slave MySQL, and rotate on fresh
snapshot request
Con: Additional EBS volumes, will still have the problem of initial
load of queries (Taking snap at every 15 mins)
Use Percona Xtrabackup as an Incremental Data Copy to the Spoof
Instance.
Con: Requires an additional EBS volume and MySQL Service needs to be
shutdown during the entire period the backup is restored.
Use ZFS file system as a mechanism of taking a snapshot at the file
system level
12. Setting up ZFS on MySQL
● Create a pool name “ZP1”
zpool create -O compression=gzip -f -o autoexpand=on "zp1" mirror "/dev/xvdm" "/dev/xvdn"
-o ashift=12
● Create a new filesystem named “data2” in POOL “ZP1”
#Create the ZFS Filesystems
- name: Create a new file system called data2 in pool zp1
zfs:
name: zp1/mysql
state: present
extra_zfs_properties:
setuid: off
compression: gzip
recordsize: 128k
atime: off
primarycache: metadata
13. Setting up ZFS on MySQL
● Create the required datasets to run MySQL
zp1/mysql 1.19T 4.92T 100K /zp1/mysql
zp1/mysql/data 1.18T 4.92T 1.17T /data2/data
zp1/mysql/logs 9.97G 4.92T 8.84G /data2/logs
zp1/mysql/tmp 216K 4.92T 152K /data2/tmp
● Configurations on MySQL
Innodb_doublewrite = 0
Innodb_checksum_algorithm = none
Innodb_use_native_aio = 0
14. ZPOOL Status
● ZPOOL status
zpool status
pool: zp1
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
zp1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
xvdm ONLINE 0 0 0
xvdn ONLINE 0 0 0
errors: No known data errors
21. ZFS - Challenges
● Fragmentation.
● Complex to tweak and tune.
● Requires extra free space or pool performance can suffer.
22. Further ...
● High Read throughput (>= 83.88 million)
● MySQL / sec upto 76.2 K
● InnoDB file I/O write upto 150K
● Enterprise-grade transactional file system.
● Automatically reconstructs data after detecting an error.
● Multiple physical media devices into one logical volume using ZPOOL.
● Snapshot and Mirroring capabilities, and can quickly compress data.
(LZ4)
Enjoy a user-friendly, high-volume storage system.