2. ZFS Was Slow, Is Faster
Adam Leventhal, CTO Delphix
@ahl
3. My Version of ZFS History
• 2001-2005 The 1st age of ZFS: building the behemoth
– Stability, reliability, features
• 2006-2008 The 2nd age of ZFS: appliance model and open source
– Completing the picture; making it work as advertised; still more features
• 2008-2010 The 3rd age of ZFS: trial by fire
– Stability in the face of real workloads
– Performance in the face of real workloads
4. The 1st Age of OpenZFS
• All the stuff Matt talked about, yes:
– Many platforms
– Many companies
– Many contributors
• Performance analysis on real and varied customer workloads
5. A note about the data
•
•
•
•
•
The data you are about to see is real
The names have been changed to protect the innocent (and guilty)
It was mostly collected with DTrace
We used some other tools as well: lockstat, mpstat
You might wish I had more / different data – I do too
15. ZFS Write Throttle
•
•
•
•
•
Keep transactions to a reasonable size – limit outstanding data
Target a fixed time (1-5 seconds on most systems)
Figure out how much we can write in that time
Don’t accept more than that amount of data in a txg
When we get to 7/8ths of the limit, insert a 10ms delay
16. ZFS Write Throttle
•
•
•
•
•
Keep transactions to a reasonable size – limit outstanding data
Target a fixed time (1-5 seconds on most systems)
Figure out how much we can write in that time
Don’t accept more than that amount of data in a txg
When we get to 7/8ths of the limit, insert a 10ms delay
WTF!?
26. IO Problems
• The choice of IO queue depth was crucial
– Where did the default of 10 come from?!
– Balance between latency and throughput
• Shared IO queue for reads and writes
– Maybe this makes sense for disks… maybe…
• The wrong queue depth caused massive queuing within ZFS
– “What do you mean my SAN is slow? It looks great to me!”
27. New IO Scheduler
•
•
•
•
Choose a limit on the “dirty” (modified) data on the system
As more accumulates, schedule more concurrent IOs
Limits per IO type
If we still can’t keep up, start to limit the rate of incoming data
• Chose defaults as close to the old behavior as possible
• Much more straightforward to measure and tune
32. Name that lock!
> 0xffffff0d4aaa4818::whatis
ffffff0d4aaa4818 is ffffff0d4aaa47fc+20, allocated from taskq_cache
> 0xffffff0d4aaa4818-20::taskq
ADDR
NAME
ACT/THDS Q'ED MAXQ INST
ffffff0d4aaa47fc zio_write_issue
0/ 24 0 26977 -
33. Lock Breakup
•
•
•
•
Broke up the taskq lock for write_issue
Added multiple taskqs, randomly assigned
Recently hit a similar problem for read_interrupt
Same solution
• Worth investigating taskq stats
• A dynamic taskq might be an interesting experiment
• Other lock contention issues resolved
• Still more need additional attention
40. What about all space_map_*() functions?
space_map_truncate
33 times
6ms ( 0%)
space_map_load_wait
1721 times
7ms ( 0%)
space_map_sync
3766 times
210ms ( 0%)
space_map_unload
135 times
1268ms ( 0%)
space_map_free
21694 times
4280ms ( 1%)
space_map_vacate
3643 times
45891ms (12%)
space_map_seg_compare
13124822 times
55423ms (14%)
space_map_add
580809 times
79868ms (21%)
space_map_remove
514181 times
81682ms (21%)
space_map_walk
2081 times
120962ms (32%)
spa_sync
1 times
374818ms (100%)
42. Spacemaps and Metaslabs
• Two things going on here:
– 30,000+ segments per spacemap
– Building the perfect spacemap – close enough would work
– Doing a bunch of work that we can clever our way out of
• Still much to be done:
– Why 200 metaslabs per LUN?
– Allocations can still be very painful
43. The Next Age of OpenZFS
• General purpose and purpose-built OpenZFS products
• Used for varied and demanding uses
• Data-driven discoveries
–
–
–
–
Write throttle needed rethinking
Metaslabs / spacemaps / allocation is fertile ground
Performance nose-dives around 85% of pool capacity
Lock contention impacts high-performance workloads
• What’s next?
–
–
–
–
More workloads; more data!
Feedback on recent enhancements
Connect allocation / scrub to the new IO scheduler
Consider data-driven, adaptive algorithms within OpenZFS