SlideShare una empresa de Scribd logo
1 de 43
Data-Driven Development in
OpenZFS
Adam Leventhal, CTO Delphix
@ahl
ZFS Was Slow, Is Faster

Adam Leventhal, CTO Delphix
@ahl
My Version of ZFS History
• 2001-2005 The 1st age of ZFS: building the behemoth
– Stability, reliability, features

• 2006-2008 The 2nd age of ZFS: appliance model and open source
– Completing the picture; making it work as advertised; still more features

• 2008-2010 The 3rd age of ZFS: trial by fire
– Stability in the face of real workloads
– Performance in the face of real workloads
The 1st Age of OpenZFS
• All the stuff Matt talked about, yes:
– Many platforms
– Many companies
– Many contributors

• Performance analysis on real and varied customer workloads
A note about the data
•
•
•
•
•

The data you are about to see is real
The names have been changed to protect the innocent (and guilty)
It was mostly collected with DTrace
We used some other tools as well: lockstat, mpstat
You might wish I had more / different data – I do too
Writes Are Slow
NFS Sync Writes
sync write
microseconds
value ------------- Distribution ------------- count
8|
0
16 |
149
32 |@@@@@@@@@@@@@@@@@@@@@
64 |@@@@@
2226
128 |@@@@
1743
256 |@@
658
512 |
95
1024 |
20
2048 |
19
4096 |
122
8192 |@@
744
16384 |@@
865
32768 |@@
625
65536 |@
316
131072 |
113
262144 |
22
524288 |
70
1048576 |
94
2097152 |
16
4194304 |
0

8682
IO Writes
write

microseconds
value ------------- Distribution ------------- count
16 |
0
32 |
338
64 |
490
128 |
720
256 |@@@@
15079
512 |@@@@@
20342
1024 |@@@@@@@
27807
2048 |@@@@@@@@
28897
4096 |@@@@@@@@
29910
8192 |@@@@@
20605
16384 |@
5081
32768 |
1079
65536 |
69
131072 |
5
262144 |
1
524288 |
0
NFS Sync Writes: Even Worse
sync write
microseconds
value ------------- Distribution ------------- count
8|
0
16 |@
9
32 |@@@@@@@@@@
84
64 |@@@@@@@@@@
85
128 |@@@@
34
256 |@
9
512 |
0
1024 |
1
2048 |
2
4096 |@
7
8192 |@@
19
16384 |@
7
32768 |
2
65536 |
2
131072 |
0
262144 |
0
524288 |
0
1048576 |@@
14
2097152 |@@@@@@
51
4194304 |@
7
8388608 |
0
First Problem: The Write Throttle
How long is spa_sync() taking?
#!/usr/sbin/dtrace -s
fbt::spa_sync:entry
/stringof(args[0]->spa_name) == "domain0"/
{
self->ts = timestamp;
loads = 0;
}
fbt::space_map_load:entry
/stringof(args[4]->os_spa->spa_name) == "domain0"/
{
loads++;
}
fbt::spa_sync:return
{
@["microseconds", loads] = quantize((timestamp - self->ts) / 1000);
self->ts = 0;
}
How long is spa_sync() taking?
# ./sync.d -c 'sleep 60'
dtrace: script './sync.d' matched 3 probes
dtrace: pid 20420 has exited

microseconds
15
value ------------- Distribution ------------- count
524288 |
0
1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
2097152 |
0
microseconds
16
value ------------- Distribution ------------- count
524288 |
0
1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
2097152 |@@@@@@@@@@
7
4194304 |
0

20
Where is spa_sync() giving up the CPU?
#!/usr/sbin/dtrace -s
fbt::spa_sync:entry{ self->ts = timestamp; }

sched:::off-cpu/self->ts/{ self->off = timestamp; }
sched:::on-cpu
/self->off/
{
@s[stack()] = quantize((timestamp - self->off) / 1000);
self->off = 0;
}
fbt::spa_sync:return
/self->ts/
{
@t["microseconds", probefunc] = quantize((timestamp - self->ts) / 1000);
self->ts = 0;
self->sync = 0;
}
Where is spa_sync() giving up the CPU?
…
genunix`cv_wait+0x61
zfs`zio_wait+0x5d
zfs`dsl_pool_sync+0xe1
zfs`spa_sync+0x38d
zfs`txg_sync_thread+0x247
unix`thread_start+0x8
value ------------- Distribution ------------- count
256 |
0
512 |@@@@@@
4
1024 |@@@@@@@@@@@@
2048 |
0
4096 |
0
8192 |
0
16384 |
0
32768 |
0
65536 |
0
131072 |
0
262144 |
0
524288 |@@@@
3
1048576 |@@@
2
2097152 |@@@@@@@@@@@@@
4194304 |@
1
8388608 |
0

8

9
ZFS Write Throttle
•
•
•
•
•

Keep transactions to a reasonable size – limit outstanding data
Target a fixed time (1-5 seconds on most systems)
Figure out how much we can write in that time
Don’t accept more than that amount of data in a txg
When we get to 7/8ths of the limit, insert a 10ms delay
ZFS Write Throttle
•
•
•
•
•

Keep transactions to a reasonable size – limit outstanding data
Target a fixed time (1-5 seconds on most systems)
Figure out how much we can write in that time
Don’t accept more than that amount of data in a txg
When we get to 7/8ths of the limit, insert a 10ms delay
WTF!?
7/8ths full delaying for 10ms
async write
microseconds
value ------------- Distribution ------------- count
16 |
0
32 |@@@@@@@@@@@@@
1549
64 |@@@@@@@@@@@
1306
128 |@@@@@@@@@
1049
256 |@@
192
512 |
34
1024 |
23
2048 |
47
4096 |@
63
8192 |@
153
16384 |@
83
32768 |
11
65536 |
5
131072 |
4
262144 |
3
524288 |@
102
1048576 |@
106
2097152 |@
69
4194304 |
0
Observing the write throttle limit (second-bysecond)
# dtrace -n 'BEGIN{ start = timestamp; } fbt::dsl_pool_sync:entry/stringof(args[0]->dp_spa->spa_name) == "domain0"/{
@[(timestamp - start) / 1000000000] = min(args[0]->dp_write_limit / 1000000); }' -xaggsortkey -c 'sleep 600'
dtrace: description 'BEGIN' matched 2 probes
…
9
470
10
470
11
487
14
487
15
515
16
515
17
557
18
581
19
581
20
617
21
617
22
635
23
663
24
663
25
673

Saw anywhere from 100 – 800 MB!
Second Problem: IO Queuing
Check out IO queue times
microseconds
write sync
value ------------- Distribution ------------- count
0|
0
1|
2
2 |@@@@@@@
51
4 |@@@@@@
43
8 |@
5
16 |
3
32 |@
6
64 |@
10
128 |@@
13
256 |@@
18
512 |@@@@@
38
1024 |@@@@@@
44
2048 |@@@@@
37
4096 |@@@
24
8192 |@
9
16384 |
0
IO times with queue depth 10 (default)
write

microseconds
value ------------- Distribution ------------- count
16 |
0
32 |
70
64 |
170
128 |
130
256 |@@
1143
512 |@@@
1762
1024 |@@@@
2417
2048 |@@@@@@@
4135
4096 |@@@@@@@@
4816
8192 |@@@@@@@
4132
16384 |@@@@
2370
32768 |@@@
1456
65536 |
148
131072 |
8
262144 |
0
IO times with queue depth 20
write

microseconds
value ------------- Distribution ------------- count
16 |
0
32 |
43
64 |
137
128 |@
243
256 |@@@@@
2233
512 |@@@@@
2238
1024 |@@@@
1968
2048 |@@@@@
2395
4096 |@@@@@@
2660
8192 |@@@@@@
2829
16384 |@@@@@
2499
32768 |@@@
1466
65536 |@
296
131072 |
0
IO times with queue depth 30
write

microseconds
value ------------- Distribution ------------- count
16 |
0
32 |
82
64 |
137
128 |
230
256 |@@@@
2195
512 |@@@@
2589
1024 |@@@@
2416
2048 |@@@@@
2844
4096 |@@@@@@
3330
8192 |@@@@@@
3794
16384 |@@@@@@
3306
32768 |@@@
2008
65536 |@
443
131072 |
1
262144 |
0
IO times with queue depth 64
microseconds
write
value ------------- Distribution ------------- count
16 |
0
32 |
345
64 |@
697
128 |
169
256 |
60
512 |
380
1024 |@
1084
2048 |@
1562
4096 |@
1819
8192 |@@@@
4974
16384 |@@@@@@@@@
32768 |@@@@@@@@@@@@@
65536 |@@@@@@@@@
131072 |@
1050
262144 |
0

write

avg latency
44557us

10683
15637
10608

iops throughput
817/s 30300k/s
IO times with queue depth 128
microseconds
write
value ------------- Distribution ------------- count
16 |
0
32 |
330
64 |@
665
128 |
228
256 |
203
512 |@
552
1024 |@
1135
2048 |@
1458
4096 |@
1434
8192 |@@
2049
16384 |@@@@
4070
32768 |@@@@@@@
7936
65536 |@@@@@@@@@@@
11269
131072 |@@@@@@@@@
9737
262144 |@
1282
524288 |
0

write

avg latency
88774us

iops throughput
705/s 38303k/s
IO Problems
• The choice of IO queue depth was crucial
– Where did the default of 10 come from?!
– Balance between latency and throughput

• Shared IO queue for reads and writes
– Maybe this makes sense for disks… maybe…

• The wrong queue depth caused massive queuing within ZFS
– “What do you mean my SAN is slow? It looks great to me!”
New IO Scheduler
•
•
•
•

Choose a limit on the “dirty” (modified) data on the system
As more accumulates, schedule more concurrent IOs
Limits per IO type
If we still can’t keep up, start to limit the rate of incoming data

• Chose defaults as close to the old behavior as possible
• Much more straightforward to measure and tune
Third Problem: Lock Contention
Looking at lockstat(1M) (1/3)
Count indv cuml rcnt nsec Lock
Caller
167980 9% 9% 0.00 61747 0xffffff0d4aaa4818 taskq_thread+0x2a8
nsec ------ Time Distribution ------ count Stack
512 |
3233
thread_start+0x8
1024 |@
10651
2048 |@@@@
26537
4096 |@@@@@@@@@@
56854
8192 |@@@@@
29262
16384 |@
10577
32768 |@
5703
65536 |
5053
131072 |
3555
262144 |
5272
524288 |
5400
1048576 |
4186
2097152 |
1487
4194304 |
163
8388608 |
17
16777216 |
21
33554432 |
7
67108864 |
2
Looking at lockstat(1M) (2/3)
Count indv cuml rcnt nsec Lock
Caller
166416 8% 17% 0.00 88424 0xffffff0d4aaa4818

cv_wait+0x69

nsec ------ Time Distribution ------ count Stack
512 |@
7775
taskq_thread_wait+0x84
1024 |@@
14577 taskq_thread+0x308
2048 |@@@@@
31499 thread_start+0x8
4096 |@@@@@@
36522
8192 |@@@
19818
16384 |@
11065
32768 |@
7302
65536 |@
7932
131072 |
5537
262144 |@
7992
524288 |@
8003
1048576 |@
6017
2097152 |
2086
4194304 |
198
8388608 |
48
16777216 |
37
33554432 |
7
67108864 |
1
Looking at lockstat(1M) (3/3)
Count indv cuml rcnt nsec Lock
Caller
136877 7% 24% 0.00 19897 0xffffff0d4aaa4818

taskq_dispatch_ent+0x4a

nsec ------ Time Distribution ------ count Stack
512 |
1798
zio_taskq_dispatch+0xb5
1024 |
1575
zio_issue_async+0x19
2048 |@
5593
zio_execute+0x8d
4096 |@@@@@@@@@@@@@
61337
8192 |@@@@
19408
16384 |@@@
15724
32768 |@@@
13923
65536 |@@
9733
131072 |
3564
262144 |
3171
524288 |
947
1048576 |
84
2097152 |
1
4194304 |
0
8388608 |
15
16777216 |
1
33554432 |
2
67108864 |
1
Name that lock!
> 0xffffff0d4aaa4818::whatis
ffffff0d4aaa4818 is ffffff0d4aaa47fc+20, allocated from taskq_cache
> 0xffffff0d4aaa4818-20::taskq
ADDR
NAME
ACT/THDS Q'ED MAXQ INST
ffffff0d4aaa47fc zio_write_issue
0/ 24 0 26977 -
Lock Breakup
•
•
•
•

Broke up the taskq lock for write_issue
Added multiple taskqs, randomly assigned
Recently hit a similar problem for read_interrupt
Same solution

• Worth investigating taskq stats
• A dynamic taskq might be an interesting experiment

• Other lock contention issues resolved
• Still more need additional attention
Last Problem: Spacemap Shenanigans
Where does spa_sync() spend its time?
…
dsl_pool_sync_done
16us ( 0%)
spa_config_exit
19us ( 0%)
zio_root
20us ( 0%)
spa_config_enter
23us ( 0%)
spa_errlog_sync
45us ( 0%)
spa_update_dspace
49us ( 0%)
zio_wait
53us ( 0%)
dmu_objset_is_dirty
66us ( 0%)
spa_sync_config_object
75us ( 0%)
spa_sync_aux_dev
79us ( 0%)
list_is_empty
86us ( 0%)
dsl_scan_sync
124us ( 0%)
ddt_sync
201us ( 0%)
txg_list_remove
519us ( 0%)
vdev_config_sync
1830us ( 0%)
bpobj_iterate
9939us ( 0%)
vdev_sync
27907us ( 1%)
bplist_iterate
35301us ( 1%)
vdev_sync_done
346336us (16%)
dsl_pool_sync
1652050us (79%)
spa_sync
2077646us (100%)
Where does spa_sync() spend its time?
…
dsl_pool_sync_done
16us ( 0%)
spa_config_exit
19us ( 0%)
zio_root
20us ( 0%)
spa_config_enter
23us ( 0%)
spa_errlog_sync
45us ( 0%)
spa_update_dspace
49us ( 0%)
zio_wait
53us ( 0%)
dmu_objset_is_dirty
66us ( 0%)
spa_sync_config_object
75us ( 0%)
spa_sync_aux_dev
79us ( 0%)
list_is_empty
86us ( 0%)
dsl_scan_sync
124us ( 0%)
ddt_sync
201us ( 0%)
txg_list_remove
519us ( 0%)
vdev_config_sync
1830us ( 0%)
bpobj_iterate
9939us ( 0%)
vdev_sync
27907us ( 1%)
bplist_iterate
35301us ( 1%)
vdev_sync_done
346336us (16%)
dsl_pool_sync
1652050us (79%)
spa_sync
2077646us (100%)

This is expected; it means
we’re writing
Where does spa_sync() spend its time?
…
dsl_pool_sync_done
16us ( 0%)
spa_config_exit
19us ( 0%)
zio_root
20us ( 0%)
spa_config_enter
23us ( 0%)
spa_errlog_sync
45us ( 0%)
spa_update_dspace
49us ( 0%)
zio_wait
53us ( 0%)
dmu_objset_is_dirty
66us ( 0%)
spa_sync_config_object
75us ( 0%)
spa_sync_aux_dev
79us ( 0%)
list_is_empty
86us ( 0%)
dsl_scan_sync
124us ( 0%)
ddt_sync
201us ( 0%)
txg_list_remove
519us ( 0%)
vdev_config_sync
1830us ( 0%)
bpobj_iterate
9939us ( 0%)
vdev_sync
27907us ( 1%)
bplist_iterate
35301us ( 1%)
vdev_sync_done
346336us (16%)
dsl_pool_sync
1652050us (79%)
spa_sync
2077646us (100%)

What’s this?
What’s vdev_sync_done() doing?
txg_list_empty
txg_list_remove
metaslab_sync_done
vdev_sync_done

0us ( 0%)
15us ( 0%)
8681us (90%)
9563us (100%)
How about metaslab_sync_done()?
vdev_dirty
vdev_space_update
space_map_load_wait
space_map_vacate
metaslab_weight
metaslab_group_sort
space_map_unload
metaslab_sync_done

3266us
5333us
5758us
30455us
54507us
68445us
1519906us
1630626us
What about all space_map_*() functions?
space_map_truncate
33 times
6ms ( 0%)
space_map_load_wait
1721 times
7ms ( 0%)
space_map_sync
3766 times
210ms ( 0%)
space_map_unload
135 times
1268ms ( 0%)
space_map_free
21694 times
4280ms ( 1%)
space_map_vacate
3643 times
45891ms (12%)
space_map_seg_compare
13124822 times
55423ms (14%)
space_map_add
580809 times
79868ms (21%)
space_map_remove
514181 times
81682ms (21%)
space_map_walk
2081 times
120962ms (32%)
spa_sync
1 times
374818ms (100%)
How about the CPU performance counters?
# dtrace -n 'cpc:::PAPI_tlb_dm-all-10000{ @[stack()] = count(); }' -n END'{ trunc(@, 20); printa(@); }' -c 'sleep 100’
…
zfs`metaslab_segsize_compare+0x1f
genunix`avl_find+0x52
genunix`avl_add+0x2d
zfs`space_map_remove+0x170
zfs`space_map_alloc+0x47
zfs`metaslab_group_alloc+0x310
zfs`metaslab_alloc_dva+0x2c1
zfs`metaslab_alloc+0x9c
zfs`zio_dva_allocate+0x8a
zfs`zio_execute+0x8d
genunix`taskq_thread+0x285
unix`thread_start+0x8
1550
zfs`lzjb_decompress+0x89
zfs`zio_decompress_data+0x53
zfs`zio_decompress+0x56
zfs`zio_pop_transforms+0x3d
zfs`zio_done+0x26b
zfs`zio_execute+0x8d
zfs`zio_notify_parent+0xa6
zfs`zio_done+0x4ea
zfs`zio_execute+0x8d
zfs`zio_notify_parent+0xa6
Spacemaps and Metaslabs
• Two things going on here:
– 30,000+ segments per spacemap
– Building the perfect spacemap – close enough would work
– Doing a bunch of work that we can clever our way out of

• Still much to be done:
– Why 200 metaslabs per LUN?
– Allocations can still be very painful
The Next Age of OpenZFS
• General purpose and purpose-built OpenZFS products
• Used for varied and demanding uses
• Data-driven discoveries
–
–
–
–

Write throttle needed rethinking
Metaslabs / spacemaps / allocation is fertile ground
Performance nose-dives around 85% of pool capacity
Lock contention impacts high-performance workloads

• What’s next?
–
–
–
–

More workloads; more data!
Feedback on recent enhancements
Connect allocation / scrub to the new IO scheduler
Consider data-driven, adaptive algorithms within OpenZFS

Más contenido relacionado

La actualidad más candente

Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...Ceph Community
 
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph EnterpriseRed Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph EnterpriseRed_Hat_Storage
 
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, CitrixXPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, CitrixThe Linux Foundation
 
Cephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmarkCephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmarkXiaoxi Chen
 
DataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The SequelDataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The SequelDataStax Academy
 
High Availability Storage (susecon2016)
High Availability Storage (susecon2016)High Availability Storage (susecon2016)
High Availability Storage (susecon2016)Roger Zhou 周志强
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Community
 
Modern Linux Tracing Landscape
Modern Linux Tracing LandscapeModern Linux Tracing Landscape
Modern Linux Tracing LandscapeKernel TLV
 
XPDS14: Efficient Interdomain Transmission of Performance Data - John Else, C...
XPDS14: Efficient Interdomain Transmission of Performance Data - John Else, C...XPDS14: Efficient Interdomain Transmission of Performance Data - John Else, C...
XPDS14: Efficient Interdomain Transmission of Performance Data - John Else, C...The Linux Foundation
 
Building the Right Platform Architecture for Hadoop
Building the Right Platform Architecture for HadoopBuilding the Right Platform Architecture for Hadoop
Building the Right Platform Architecture for HadoopAll Things Open
 
An Updated Performance Comparison of Virtual Machines and Linux Containers
An Updated Performance Comparison of Virtual Machines and Linux ContainersAn Updated Performance Comparison of Virtual Machines and Linux Containers
An Updated Performance Comparison of Virtual Machines and Linux ContainersKento Aoyama
 
Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneCeph Community
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing GuideJose De La Rosa
 
GlusterFS w/ Tiered XFS
GlusterFS w/ Tiered XFS  GlusterFS w/ Tiered XFS
GlusterFS w/ Tiered XFS Gluster.org
 
Scylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScyllaDB
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashCeph Community
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Danielle Womboldt
 
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...Patrick McGarry
 

La actualidad más candente (20)

Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
 
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph EnterpriseRed Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
 
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, CitrixXPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
 
Cephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmarkCephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmark
 
DataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The SequelDataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The Sequel
 
High Availability Storage (susecon2016)
High Availability Storage (susecon2016)High Availability Storage (susecon2016)
High Availability Storage (susecon2016)
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
 
Modern Linux Tracing Landscape
Modern Linux Tracing LandscapeModern Linux Tracing Landscape
Modern Linux Tracing Landscape
 
XPDS14: Efficient Interdomain Transmission of Performance Data - John Else, C...
XPDS14: Efficient Interdomain Transmission of Performance Data - John Else, C...XPDS14: Efficient Interdomain Transmission of Performance Data - John Else, C...
XPDS14: Efficient Interdomain Transmission of Performance Data - John Else, C...
 
Building the Right Platform Architecture for Hadoop
Building the Right Platform Architecture for HadoopBuilding the Right Platform Architecture for Hadoop
Building the Right Platform Architecture for Hadoop
 
Stabilizing Ceph
Stabilizing CephStabilizing Ceph
Stabilizing Ceph
 
An Updated Performance Comparison of Virtual Machines and Linux Containers
An Updated Performance Comparison of Virtual Machines and Linux ContainersAn Updated Performance Comparison of Virtual Machines and Linux Containers
An Updated Performance Comparison of Virtual Machines and Linux Containers
 
Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William Byrne
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
GlusterFS w/ Tiered XFS
GlusterFS w/ Tiered XFS  GlusterFS w/ Tiered XFS
GlusterFS w/ Tiered XFS
 
Linux clustering solution
Linux clustering solutionLinux clustering solution
Linux clustering solution
 
Scylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with Raft
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
 
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
 

Similar a OpenZFS data-driven performance

Understanding Performance with DTrace
Understanding Performance with DTraceUnderstanding Performance with DTrace
Understanding Performance with DTraceahl0003
 
Real-time in the real world: DIRT in production
Real-time in the real world: DIRT in productionReal-time in the real world: DIRT in production
Real-time in the real world: DIRT in productionbcantrill
 
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...ZFConf Conference
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
 
Debugging linux issues with eBPF
Debugging linux issues with eBPFDebugging linux issues with eBPF
Debugging linux issues with eBPFIvan Babrou
 
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Rakib Hossain
 
Spca2014 advanced share point troubleshooting hessing
Spca2014 advanced share point troubleshooting hessingSpca2014 advanced share point troubleshooting hessing
Spca2014 advanced share point troubleshooting hessingNCCOMMS
 
In Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry OsborneIn Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry OsborneEnkitec
 
Oracle Database In-Memory Option in Action
Oracle Database In-Memory Option in ActionOracle Database In-Memory Option in Action
Oracle Database In-Memory Option in ActionTanel Poder
 
1404 app dev series - session 8 - monitoring & performance tuning
1404   app dev series - session 8 - monitoring & performance tuning1404   app dev series - session 8 - monitoring & performance tuning
1404 app dev series - session 8 - monitoring & performance tuningMongoDB
 
Sobanski odl summit_2015
Sobanski odl summit_2015Sobanski odl summit_2015
Sobanski odl summit_2015John Sobanski
 
Percona Live UK 2014 Part III
Percona Live UK 2014  Part IIIPercona Live UK 2014  Part III
Percona Live UK 2014 Part IIIAlkin Tezuysal
 
Troubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTroubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTanel Poder
 
Oracle Basics and Architecture
Oracle Basics and ArchitectureOracle Basics and Architecture
Oracle Basics and ArchitectureSidney Chen
 
Finding an unusual cause of max_user_connections in MySQL
Finding an unusual cause of max_user_connections in MySQLFinding an unusual cause of max_user_connections in MySQL
Finding an unusual cause of max_user_connections in MySQLOlivier Doucet
 
Output drops due to qo s on cisco 2960 3560 3750 switches
Output drops due to qo s on cisco 2960 3560 3750 switchesOutput drops due to qo s on cisco 2960 3560 3750 switches
Output drops due to qo s on cisco 2960 3560 3750 switchescandy tang
 
Fine grained monitoring
Fine grained monitoringFine grained monitoring
Fine grained monitoringIben Rodriguez
 
Reverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemReverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemCyber Security Alliance
 
Deep review of LMS process
Deep review of LMS processDeep review of LMS process
Deep review of LMS processRiyaj Shamsudeen
 
Being HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on PurposeBeing HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on PurposeAman Kohli
 

Similar a OpenZFS data-driven performance (20)

Understanding Performance with DTrace
Understanding Performance with DTraceUnderstanding Performance with DTrace
Understanding Performance with DTrace
 
Real-time in the real world: DIRT in production
Real-time in the real world: DIRT in productionReal-time in the real world: DIRT in production
Real-time in the real world: DIRT in production
 
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Debugging linux issues with eBPF
Debugging linux issues with eBPFDebugging linux issues with eBPF
Debugging linux issues with eBPF
 
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
 
Spca2014 advanced share point troubleshooting hessing
Spca2014 advanced share point troubleshooting hessingSpca2014 advanced share point troubleshooting hessing
Spca2014 advanced share point troubleshooting hessing
 
In Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry OsborneIn Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry Osborne
 
Oracle Database In-Memory Option in Action
Oracle Database In-Memory Option in ActionOracle Database In-Memory Option in Action
Oracle Database In-Memory Option in Action
 
1404 app dev series - session 8 - monitoring & performance tuning
1404   app dev series - session 8 - monitoring & performance tuning1404   app dev series - session 8 - monitoring & performance tuning
1404 app dev series - session 8 - monitoring & performance tuning
 
Sobanski odl summit_2015
Sobanski odl summit_2015Sobanski odl summit_2015
Sobanski odl summit_2015
 
Percona Live UK 2014 Part III
Percona Live UK 2014  Part IIIPercona Live UK 2014  Part III
Percona Live UK 2014 Part III
 
Troubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTroubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contention
 
Oracle Basics and Architecture
Oracle Basics and ArchitectureOracle Basics and Architecture
Oracle Basics and Architecture
 
Finding an unusual cause of max_user_connections in MySQL
Finding an unusual cause of max_user_connections in MySQLFinding an unusual cause of max_user_connections in MySQL
Finding an unusual cause of max_user_connections in MySQL
 
Output drops due to qo s on cisco 2960 3560 3750 switches
Output drops due to qo s on cisco 2960 3560 3750 switchesOutput drops due to qo s on cisco 2960 3560 3750 switches
Output drops due to qo s on cisco 2960 3560 3750 switches
 
Fine grained monitoring
Fine grained monitoringFine grained monitoring
Fine grained monitoring
 
Reverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemReverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande Modem
 
Deep review of LMS process
Deep review of LMS processDeep review of LMS process
Deep review of LMS process
 
Being HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on PurposeBeing HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on Purpose
 

Último

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Último (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

OpenZFS data-driven performance

  • 1. Data-Driven Development in OpenZFS Adam Leventhal, CTO Delphix @ahl
  • 2. ZFS Was Slow, Is Faster Adam Leventhal, CTO Delphix @ahl
  • 3. My Version of ZFS History • 2001-2005 The 1st age of ZFS: building the behemoth – Stability, reliability, features • 2006-2008 The 2nd age of ZFS: appliance model and open source – Completing the picture; making it work as advertised; still more features • 2008-2010 The 3rd age of ZFS: trial by fire – Stability in the face of real workloads – Performance in the face of real workloads
  • 4. The 1st Age of OpenZFS • All the stuff Matt talked about, yes: – Many platforms – Many companies – Many contributors • Performance analysis on real and varied customer workloads
  • 5. A note about the data • • • • • The data you are about to see is real The names have been changed to protect the innocent (and guilty) It was mostly collected with DTrace We used some other tools as well: lockstat, mpstat You might wish I had more / different data – I do too
  • 7. NFS Sync Writes sync write microseconds value ------------- Distribution ------------- count 8| 0 16 | 149 32 |@@@@@@@@@@@@@@@@@@@@@ 64 |@@@@@ 2226 128 |@@@@ 1743 256 |@@ 658 512 | 95 1024 | 20 2048 | 19 4096 | 122 8192 |@@ 744 16384 |@@ 865 32768 |@@ 625 65536 |@ 316 131072 | 113 262144 | 22 524288 | 70 1048576 | 94 2097152 | 16 4194304 | 0 8682
  • 8. IO Writes write microseconds value ------------- Distribution ------------- count 16 | 0 32 | 338 64 | 490 128 | 720 256 |@@@@ 15079 512 |@@@@@ 20342 1024 |@@@@@@@ 27807 2048 |@@@@@@@@ 28897 4096 |@@@@@@@@ 29910 8192 |@@@@@ 20605 16384 |@ 5081 32768 | 1079 65536 | 69 131072 | 5 262144 | 1 524288 | 0
  • 9. NFS Sync Writes: Even Worse sync write microseconds value ------------- Distribution ------------- count 8| 0 16 |@ 9 32 |@@@@@@@@@@ 84 64 |@@@@@@@@@@ 85 128 |@@@@ 34 256 |@ 9 512 | 0 1024 | 1 2048 | 2 4096 |@ 7 8192 |@@ 19 16384 |@ 7 32768 | 2 65536 | 2 131072 | 0 262144 | 0 524288 | 0 1048576 |@@ 14 2097152 |@@@@@@ 51 4194304 |@ 7 8388608 | 0
  • 10. First Problem: The Write Throttle
  • 11. How long is spa_sync() taking? #!/usr/sbin/dtrace -s fbt::spa_sync:entry /stringof(args[0]->spa_name) == "domain0"/ { self->ts = timestamp; loads = 0; } fbt::space_map_load:entry /stringof(args[4]->os_spa->spa_name) == "domain0"/ { loads++; } fbt::spa_sync:return { @["microseconds", loads] = quantize((timestamp - self->ts) / 1000); self->ts = 0; }
  • 12. How long is spa_sync() taking? # ./sync.d -c 'sleep 60' dtrace: script './sync.d' matched 3 probes dtrace: pid 20420 has exited microseconds 15 value ------------- Distribution ------------- count 524288 | 0 1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 2097152 | 0 microseconds 16 value ------------- Distribution ------------- count 524288 | 0 1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2097152 |@@@@@@@@@@ 7 4194304 | 0 20
  • 13. Where is spa_sync() giving up the CPU? #!/usr/sbin/dtrace -s fbt::spa_sync:entry{ self->ts = timestamp; } sched:::off-cpu/self->ts/{ self->off = timestamp; } sched:::on-cpu /self->off/ { @s[stack()] = quantize((timestamp - self->off) / 1000); self->off = 0; } fbt::spa_sync:return /self->ts/ { @t["microseconds", probefunc] = quantize((timestamp - self->ts) / 1000); self->ts = 0; self->sync = 0; }
  • 14. Where is spa_sync() giving up the CPU? … genunix`cv_wait+0x61 zfs`zio_wait+0x5d zfs`dsl_pool_sync+0xe1 zfs`spa_sync+0x38d zfs`txg_sync_thread+0x247 unix`thread_start+0x8 value ------------- Distribution ------------- count 256 | 0 512 |@@@@@@ 4 1024 |@@@@@@@@@@@@ 2048 | 0 4096 | 0 8192 | 0 16384 | 0 32768 | 0 65536 | 0 131072 | 0 262144 | 0 524288 |@@@@ 3 1048576 |@@@ 2 2097152 |@@@@@@@@@@@@@ 4194304 |@ 1 8388608 | 0 8 9
  • 15. ZFS Write Throttle • • • • • Keep transactions to a reasonable size – limit outstanding data Target a fixed time (1-5 seconds on most systems) Figure out how much we can write in that time Don’t accept more than that amount of data in a txg When we get to 7/8ths of the limit, insert a 10ms delay
  • 16. ZFS Write Throttle • • • • • Keep transactions to a reasonable size – limit outstanding data Target a fixed time (1-5 seconds on most systems) Figure out how much we can write in that time Don’t accept more than that amount of data in a txg When we get to 7/8ths of the limit, insert a 10ms delay WTF!?
  • 17. 7/8ths full delaying for 10ms async write microseconds value ------------- Distribution ------------- count 16 | 0 32 |@@@@@@@@@@@@@ 1549 64 |@@@@@@@@@@@ 1306 128 |@@@@@@@@@ 1049 256 |@@ 192 512 | 34 1024 | 23 2048 | 47 4096 |@ 63 8192 |@ 153 16384 |@ 83 32768 | 11 65536 | 5 131072 | 4 262144 | 3 524288 |@ 102 1048576 |@ 106 2097152 |@ 69 4194304 | 0
  • 18. Observing the write throttle limit (second-bysecond) # dtrace -n 'BEGIN{ start = timestamp; } fbt::dsl_pool_sync:entry/stringof(args[0]->dp_spa->spa_name) == "domain0"/{ @[(timestamp - start) / 1000000000] = min(args[0]->dp_write_limit / 1000000); }' -xaggsortkey -c 'sleep 600' dtrace: description 'BEGIN' matched 2 probes … 9 470 10 470 11 487 14 487 15 515 16 515 17 557 18 581 19 581 20 617 21 617 22 635 23 663 24 663 25 673 Saw anywhere from 100 – 800 MB!
  • 20. Check out IO queue times microseconds write sync value ------------- Distribution ------------- count 0| 0 1| 2 2 |@@@@@@@ 51 4 |@@@@@@ 43 8 |@ 5 16 | 3 32 |@ 6 64 |@ 10 128 |@@ 13 256 |@@ 18 512 |@@@@@ 38 1024 |@@@@@@ 44 2048 |@@@@@ 37 4096 |@@@ 24 8192 |@ 9 16384 | 0
  • 21. IO times with queue depth 10 (default) write microseconds value ------------- Distribution ------------- count 16 | 0 32 | 70 64 | 170 128 | 130 256 |@@ 1143 512 |@@@ 1762 1024 |@@@@ 2417 2048 |@@@@@@@ 4135 4096 |@@@@@@@@ 4816 8192 |@@@@@@@ 4132 16384 |@@@@ 2370 32768 |@@@ 1456 65536 | 148 131072 | 8 262144 | 0
  • 22. IO times with queue depth 20 write microseconds value ------------- Distribution ------------- count 16 | 0 32 | 43 64 | 137 128 |@ 243 256 |@@@@@ 2233 512 |@@@@@ 2238 1024 |@@@@ 1968 2048 |@@@@@ 2395 4096 |@@@@@@ 2660 8192 |@@@@@@ 2829 16384 |@@@@@ 2499 32768 |@@@ 1466 65536 |@ 296 131072 | 0
  • 23. IO times with queue depth 30 write microseconds value ------------- Distribution ------------- count 16 | 0 32 | 82 64 | 137 128 | 230 256 |@@@@ 2195 512 |@@@@ 2589 1024 |@@@@ 2416 2048 |@@@@@ 2844 4096 |@@@@@@ 3330 8192 |@@@@@@ 3794 16384 |@@@@@@ 3306 32768 |@@@ 2008 65536 |@ 443 131072 | 1 262144 | 0
  • 24. IO times with queue depth 64 microseconds write value ------------- Distribution ------------- count 16 | 0 32 | 345 64 |@ 697 128 | 169 256 | 60 512 | 380 1024 |@ 1084 2048 |@ 1562 4096 |@ 1819 8192 |@@@@ 4974 16384 |@@@@@@@@@ 32768 |@@@@@@@@@@@@@ 65536 |@@@@@@@@@ 131072 |@ 1050 262144 | 0 write avg latency 44557us 10683 15637 10608 iops throughput 817/s 30300k/s
  • 25. IO times with queue depth 128 microseconds write value ------------- Distribution ------------- count 16 | 0 32 | 330 64 |@ 665 128 | 228 256 | 203 512 |@ 552 1024 |@ 1135 2048 |@ 1458 4096 |@ 1434 8192 |@@ 2049 16384 |@@@@ 4070 32768 |@@@@@@@ 7936 65536 |@@@@@@@@@@@ 11269 131072 |@@@@@@@@@ 9737 262144 |@ 1282 524288 | 0 write avg latency 88774us iops throughput 705/s 38303k/s
  • 26. IO Problems • The choice of IO queue depth was crucial – Where did the default of 10 come from?! – Balance between latency and throughput • Shared IO queue for reads and writes – Maybe this makes sense for disks… maybe… • The wrong queue depth caused massive queuing within ZFS – “What do you mean my SAN is slow? It looks great to me!”
  • 27. New IO Scheduler • • • • Choose a limit on the “dirty” (modified) data on the system As more accumulates, schedule more concurrent IOs Limits per IO type If we still can’t keep up, start to limit the rate of incoming data • Chose defaults as close to the old behavior as possible • Much more straightforward to measure and tune
  • 28. Third Problem: Lock Contention
  • 29. Looking at lockstat(1M) (1/3) Count indv cuml rcnt nsec Lock Caller 167980 9% 9% 0.00 61747 0xffffff0d4aaa4818 taskq_thread+0x2a8 nsec ------ Time Distribution ------ count Stack 512 | 3233 thread_start+0x8 1024 |@ 10651 2048 |@@@@ 26537 4096 |@@@@@@@@@@ 56854 8192 |@@@@@ 29262 16384 |@ 10577 32768 |@ 5703 65536 | 5053 131072 | 3555 262144 | 5272 524288 | 5400 1048576 | 4186 2097152 | 1487 4194304 | 163 8388608 | 17 16777216 | 21 33554432 | 7 67108864 | 2
  • 30. Looking at lockstat(1M) (2/3) Count indv cuml rcnt nsec Lock Caller 166416 8% 17% 0.00 88424 0xffffff0d4aaa4818 cv_wait+0x69 nsec ------ Time Distribution ------ count Stack 512 |@ 7775 taskq_thread_wait+0x84 1024 |@@ 14577 taskq_thread+0x308 2048 |@@@@@ 31499 thread_start+0x8 4096 |@@@@@@ 36522 8192 |@@@ 19818 16384 |@ 11065 32768 |@ 7302 65536 |@ 7932 131072 | 5537 262144 |@ 7992 524288 |@ 8003 1048576 |@ 6017 2097152 | 2086 4194304 | 198 8388608 | 48 16777216 | 37 33554432 | 7 67108864 | 1
  • 31. Looking at lockstat(1M) (3/3) Count indv cuml rcnt nsec Lock Caller 136877 7% 24% 0.00 19897 0xffffff0d4aaa4818 taskq_dispatch_ent+0x4a nsec ------ Time Distribution ------ count Stack 512 | 1798 zio_taskq_dispatch+0xb5 1024 | 1575 zio_issue_async+0x19 2048 |@ 5593 zio_execute+0x8d 4096 |@@@@@@@@@@@@@ 61337 8192 |@@@@ 19408 16384 |@@@ 15724 32768 |@@@ 13923 65536 |@@ 9733 131072 | 3564 262144 | 3171 524288 | 947 1048576 | 84 2097152 | 1 4194304 | 0 8388608 | 15 16777216 | 1 33554432 | 2 67108864 | 1
  • 32. Name that lock! > 0xffffff0d4aaa4818::whatis ffffff0d4aaa4818 is ffffff0d4aaa47fc+20, allocated from taskq_cache > 0xffffff0d4aaa4818-20::taskq ADDR NAME ACT/THDS Q'ED MAXQ INST ffffff0d4aaa47fc zio_write_issue 0/ 24 0 26977 -
  • 33. Lock Breakup • • • • Broke up the taskq lock for write_issue Added multiple taskqs, randomly assigned Recently hit a similar problem for read_interrupt Same solution • Worth investigating taskq stats • A dynamic taskq might be an interesting experiment • Other lock contention issues resolved • Still more need additional attention
  • 34. Last Problem: Spacemap Shenanigans
  • 35. Where does spa_sync() spend its time? … dsl_pool_sync_done 16us ( 0%) spa_config_exit 19us ( 0%) zio_root 20us ( 0%) spa_config_enter 23us ( 0%) spa_errlog_sync 45us ( 0%) spa_update_dspace 49us ( 0%) zio_wait 53us ( 0%) dmu_objset_is_dirty 66us ( 0%) spa_sync_config_object 75us ( 0%) spa_sync_aux_dev 79us ( 0%) list_is_empty 86us ( 0%) dsl_scan_sync 124us ( 0%) ddt_sync 201us ( 0%) txg_list_remove 519us ( 0%) vdev_config_sync 1830us ( 0%) bpobj_iterate 9939us ( 0%) vdev_sync 27907us ( 1%) bplist_iterate 35301us ( 1%) vdev_sync_done 346336us (16%) dsl_pool_sync 1652050us (79%) spa_sync 2077646us (100%)
  • 36. Where does spa_sync() spend its time? … dsl_pool_sync_done 16us ( 0%) spa_config_exit 19us ( 0%) zio_root 20us ( 0%) spa_config_enter 23us ( 0%) spa_errlog_sync 45us ( 0%) spa_update_dspace 49us ( 0%) zio_wait 53us ( 0%) dmu_objset_is_dirty 66us ( 0%) spa_sync_config_object 75us ( 0%) spa_sync_aux_dev 79us ( 0%) list_is_empty 86us ( 0%) dsl_scan_sync 124us ( 0%) ddt_sync 201us ( 0%) txg_list_remove 519us ( 0%) vdev_config_sync 1830us ( 0%) bpobj_iterate 9939us ( 0%) vdev_sync 27907us ( 1%) bplist_iterate 35301us ( 1%) vdev_sync_done 346336us (16%) dsl_pool_sync 1652050us (79%) spa_sync 2077646us (100%) This is expected; it means we’re writing
  • 37. Where does spa_sync() spend its time? … dsl_pool_sync_done 16us ( 0%) spa_config_exit 19us ( 0%) zio_root 20us ( 0%) spa_config_enter 23us ( 0%) spa_errlog_sync 45us ( 0%) spa_update_dspace 49us ( 0%) zio_wait 53us ( 0%) dmu_objset_is_dirty 66us ( 0%) spa_sync_config_object 75us ( 0%) spa_sync_aux_dev 79us ( 0%) list_is_empty 86us ( 0%) dsl_scan_sync 124us ( 0%) ddt_sync 201us ( 0%) txg_list_remove 519us ( 0%) vdev_config_sync 1830us ( 0%) bpobj_iterate 9939us ( 0%) vdev_sync 27907us ( 1%) bplist_iterate 35301us ( 1%) vdev_sync_done 346336us (16%) dsl_pool_sync 1652050us (79%) spa_sync 2077646us (100%) What’s this?
  • 40. What about all space_map_*() functions? space_map_truncate 33 times 6ms ( 0%) space_map_load_wait 1721 times 7ms ( 0%) space_map_sync 3766 times 210ms ( 0%) space_map_unload 135 times 1268ms ( 0%) space_map_free 21694 times 4280ms ( 1%) space_map_vacate 3643 times 45891ms (12%) space_map_seg_compare 13124822 times 55423ms (14%) space_map_add 580809 times 79868ms (21%) space_map_remove 514181 times 81682ms (21%) space_map_walk 2081 times 120962ms (32%) spa_sync 1 times 374818ms (100%)
  • 41. How about the CPU performance counters? # dtrace -n 'cpc:::PAPI_tlb_dm-all-10000{ @[stack()] = count(); }' -n END'{ trunc(@, 20); printa(@); }' -c 'sleep 100’ … zfs`metaslab_segsize_compare+0x1f genunix`avl_find+0x52 genunix`avl_add+0x2d zfs`space_map_remove+0x170 zfs`space_map_alloc+0x47 zfs`metaslab_group_alloc+0x310 zfs`metaslab_alloc_dva+0x2c1 zfs`metaslab_alloc+0x9c zfs`zio_dva_allocate+0x8a zfs`zio_execute+0x8d genunix`taskq_thread+0x285 unix`thread_start+0x8 1550 zfs`lzjb_decompress+0x89 zfs`zio_decompress_data+0x53 zfs`zio_decompress+0x56 zfs`zio_pop_transforms+0x3d zfs`zio_done+0x26b zfs`zio_execute+0x8d zfs`zio_notify_parent+0xa6 zfs`zio_done+0x4ea zfs`zio_execute+0x8d zfs`zio_notify_parent+0xa6
  • 42. Spacemaps and Metaslabs • Two things going on here: – 30,000+ segments per spacemap – Building the perfect spacemap – close enough would work – Doing a bunch of work that we can clever our way out of • Still much to be done: – Why 200 metaslabs per LUN? – Allocations can still be very painful
  • 43. The Next Age of OpenZFS • General purpose and purpose-built OpenZFS products • Used for varied and demanding uses • Data-driven discoveries – – – – Write throttle needed rethinking Metaslabs / spacemaps / allocation is fertile ground Performance nose-dives around 85% of pool capacity Lock contention impacts high-performance workloads • What’s next? – – – – More workloads; more data! Feedback on recent enhancements Connect allocation / scrub to the new IO scheduler Consider data-driven, adaptive algorithms within OpenZFS