SlideShare una empresa de Scribd logo
1 de 25
Descargar para leer sin conexión
A Tale of a Pathological Storage Workload
Eric Sproul, Circonus
ZFS User Conference
March 2017
Fragging Rights
Fragging Rights | Eric Sproul | March 17, 2017
Circonus Release Engineer
Wearer of other hats as well (ops, sales eng., support)
ZFS user since Solaris 10u2 (>10 years ago now??)
Helped bring OmniOS to the world @ OmniTI
Eric Sproul
@eirescot
esproul
• How ZFS manages free space
• Our workload
• Problem manifesta=on
• Lessons learned and future direc=on
We’ll talk about free space, COW, and the hole we dug for ourselves:
Fragging Rights | Eric Sproul | March 17, 2017
ZFS handled it well… unFl it didn’t.
A Tale of Surprise and Pain
Fragging Rights | Eric Sproul | March 17, 2017
ZFS tracks free space with space maps
Time-ordered log of allocaNons and frees
Top-level vdevs divided into a few hundred metaslabs,
each with its own space map
At metaslab load, the map is read into AVL tree in memory
Free Space in ZFS
For details, see Jeff Bonwick’s excellent explanaFon of space maps:

hOp://web.archive.org/web/20080616104156/hOp://blogs.sun.com/bonwick/entry/space_maps
Fragging Rights | Eric Sproul | March 17, 2017
Never overwrite exisNng data.
“UpdaNng a file” means new bits
wriXen to previously free space,
followed by freeing of old chunk.
Copy-on-Write
Scalable =me-series data store
Compact on-disk format while not sacrificing granularity
Enable rich analysis of historical telemetry data
Store value plus 6 other pieces of info:
• variance of value (stddev)
• deriva=ve (change over =me) and stddev of deriva=ve
• counter (non-nega=ve deriva=ve) and stddev of counter
• count of samples that were averaged to produce value
Fragging Rights | Eric Sproul | March 17, 2017
“It seemed like a good idea at the Fme…”
The Workload
Fragging Rights | Eric Sproul | March 17, 2017
File format is columnar and offset-based, with 32-byte records.
Common case is appending to the end (most recent measurement/rollup).
No synchronous semanNcs used for updaNng data files
(this is a cluster and we have write-ahead logs).
5x 1m records = 32 * 5 = 160 bytes append
1x 5m rollup = 32 bytes append
1x 3h rollup = 32 bytes append, then update w/ recalculated rollup
The Workload
Fragging Rights | Eric Sproul | March 17, 2017
With copy-on-write, every Circonus record append or overwrite modifies a ZFS
block, freeing the old copy of that block.
ZFS has variable block sizes: minimum is a single disk sector (512b/4K),
max is recordsize property (we used 8K).
When the tail block reaches 8K, it stops gedng updated
and a new tail block starts.
The Workload
Fragging Rights | Eric Sproul | March 17, 2017
ZFS sees tail block updated every 5 minutes for:
1m files: 8192/160 = 51 Nmes (~4 hours)
5m files: 8192/32 = 256 Nmes (~21 hours)
3h files: tragic
• last record wriXen, then rewriXen 35 more Nmes

(as 3h rollup value gets recalculated)
• 256 records to fill 8K, ZFS sees 256*36 = 9216 block updates (32 days)
Alas, we also use compression, so it’s actually ~2x worse than this.
The Workload
Fragging Rights | Eric Sproul | March 17, 2017
Aher “some Nme”, depending on ingesNon volume and pool/vdev size,
performance degrades swihly.
TXGs take longer and longer to process, stalling the ZIO pipeline.
ZIO latency bubbles up to userland; applicaNon sees
increasing write syscall latency, lowering throughput.
Customers are sad. 😔
The Problem
Fragging Rights | Eric Sproul | March 17, 2017
Start with what the app sees: DTrace syscalls
(us) syscall: pwrite bytes: 32
value ------------- Distribution ------------- count
4 | 0
8 | 878
16 |@@@@@@@@@@@@ 27051
32 |@@@@@@@@@@@@@@@ 34310
64 |@@@@@@ 13361
128 |@ 2586
256 | 148
512 | 53
1024 | 39
2048 | 33
4096 | 82
8192 | 534
16384 |@@@@ 8614
32768 | 474
65536 | 22
131072 | 0
262144 | 0
524288 | 0
1048576 | 36
2097152 | 335
4194304 | 72
8388608 | 0
Troubleshoo=ng
lolwut?!?
bad
App
VFS
ZFS
(disks)
userland
kernel
syscall
Fragging Rights | Eric Sproul | March 17, 2017
Slow writes must mean disks are saturated, right?
extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
data 918.3 2638.5 4644.3 32217.0 1362.7 11.8 386.4 21 40
rpool 0.9 4.4 3.5 22.3 0.0 0.0 2.0 0 0
sd0 0.9 4.4 3.5 22.3 0.0 0.0 0.5 0 0
sd6 67.6 175.1 347.9 2299.8 0.0 0.6 2.6 0 13
sd7 64.4 225.5 335.9 2300.7 0.0 0.7 2.3 0 14
sd8 67.3 167.8 314.5 2300.4 0.0 0.6 2.6 0 13
sd9 65.3 173.8 326.7 2299.8 0.0 0.6 2.6 0 13
sd10 66.1 226.6 332.2 2300.7 0.0 0.6 2.2 0 13
sd11 67.2 153.9 338.8 2301.4 0.0 0.4 2.0 0 11
sd12 69.4 154.6 345.5 2301.4 0.0 0.4 2.0 0 11
sd13 64.0 162.0 321.9 2300.8 0.0 0.4 2.0 0 11
sd14 65.5 163.7 328.7 2300.8 0.0 0.4 2.0 0 11
sd15 64.5 221.4 343.9 2303.5 0.0 0.8 2.7 0 15
sd16 61.1 222.6 318.1 2303.5 0.0 0.8 2.8 0 15
sd17 63.5 211.3 338.1 2303.2 0.0 0.7 2.7 0 14
sd18 63.8 213.0 330.6 2303.2 0.0 0.7 2.6 0 14
sd19 68.7 170.0 321.9 2300.4 0.0 0.6 2.5 0 13
Troubleshoo=ng
App
VFS
ZFS
(disks)
userland
kernel
iostat -x
Fragging Rights | Eric Sproul | March 17, 2017
We know the problem is in the write path, but it's not the disks.
What Now?
TL;DR is that ZFS internals are very complicated and difficult to reason about, especially with
legacy ZFS code (OmniOS r151006, circa 2013).
We flailed around for a while, using DTrace to generate kernel flame graphs,
to get a sense of what the kernel was up to.
Fragging Rights | Eric Sproul | March 17, 2017
Flame graphs!
Kernel Stack Profiling
System mostly idle
(not shown),
but when not, it's ZFS.
hXps://github.com/brendangregg/FlameGraph
Fragging Rights | Eric Sproul | March 17, 2017
>1s to find free space?!?
metaslab_alloc
time (nsec)
value ------------- Distribution ------------- count
1024 | 0
2048 | 1166
4096 |@@@@@ 41714
8192 |@@@@@@@@@@@@@@@@ 134407
16384 |@@@@@@@@@@@@@ 107140
32768 |@@ 17603
65536 |@ 10066
131072 |@ 10315
262144 |@ 8144
524288 | 3715
1048576 | 1598
2097152 | 581
4194304 | 75
8388608 | 49
16777216 | 0
33554432 | 0
67108864 | 0
134217728 | 0
268435456 | 0
536870912 | 0
1073741824 | 1276
2147483648 | 0
Slab Alloca=on
App
VFS
ZFS
(disks)
userland
kernel
metaslab_alloc
• pwrite(2) syscall is slow.
• Slow because ZFS takes so long to allocate new space for writes.
• We don't know precisely why these alloca=ons are bogged down,

but it likely involves the nature of the pool's free space.
StarNng with applicaNon-perceived latency:
Fragging Rights | Eric Sproul | March 17, 2017
What do we know now?
Troubleshoo=ng: Recap
Fragging Rights | Eric Sproul | March 17, 2017
Aggregate free space isn't everything
Needed to visualize exisNng metaslab allocaNons:
Quan=ty vs. Quality
# zdb -m data
Metaslabs:
vdev 0
metaslabs 116 offset spacemap free
--------------- ------------------- --------------- -------------
metaslab 0 offset 0 spacemap 38 free 2.35G
metaslab 1 offset 100000000 spacemap 153 free 2.19G
metaslab 2 offset 200000000 spacemap 156 free 1.70G
metaslab 3 offset 300000000 spacemap 158 free 639M
metaslab 4 offset 400000000 spacemap 160 free 1.11G
Fragging Rights | Eric Sproul | March 17, 2017
Visualizing Metaslabs
Credit for the idea: hOps://www.delphix.com/blog/delphix-engineering/zfs-write-performance-impact-fragmentaFon
Color indicates fullness;
Green = empty
Red = full
Percentage is remaining
free space.
Fragging Rights | Eric Sproul | March 17, 2017
Visualizing Metaslabs
Aher data rewrite,
allocaNons are Nghter.
Performance problem
is gone.
Did we just create "ZFS Defrag"?
Fragging Rights | Eric Sproul | March 17, 2017
From one slab, on one vdev, using `zdb -mmm`:
Spacemap Detail
Metaslabs:
vdev 0
metaslabs 116 offset spacemap free
--------------- ------------------- --------------- -------------
metaslab 39 offset 9c00000000 spacemap 357 free 13.2G
segments 1310739 maxsize 168K freepct 82%
[ 0] ALLOC: txg 7900863, pass 1
[ 1] A range: 9c00000000-9c00938800 size: 938800
[ 2] A range: 9c00939000-9c0093d200 size: 004200
...
[4041229] FREE: txg 7974611, pass 1
[4041230] F range: 9d79d10600-9d79d10800 size: 000200
[4041231] F range: 9e84efe400-9e84efe600 size: 000200
[4041232] FREE: txg 7974612, pass 1
[4041233] F range: 9e72ba4600-9e72ba5400 size: 000e00
90th %ile freed size: ~9K
31% of frees are 512b-1K
>4M records in one spacemap...

Costly to load, only to discover you can't allocate from it!

Probably contributes to those long metaslab_alloc() =mes.
• Spacemap histograms
• Visible via zdb (-mm) and mdb (::spa -mh)
• Metaslab fragmenta=on metrics
• Allocator changes to account for fragmenta=on
• New tuning knobs* for write thro_le
Since our iniNal foray into this issue, new features have come out:
Fragging Rights | Eric Sproul | March 17, 2017
We weren't alone!
OpenZFS Improvements
* hOp://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/
Fragging Rights | Eric Sproul | March 17, 2017
metaslab 2 offset 400000000 spacemap 227 free 7.99G
On-disk histogram: fragmentation 90
9: 632678 ****************************************
10: 198275 *************
11: 342565 **********************
12: 460625 ******************************
13: 213397 **************
14: 82860 ******
15: 9774 *
16: 137 *
17: 1 *
Spacemap Histogram
Key is power-of-2 range size
FragmentaNon metric is based on this distribuNon
• Performance is fine un=l some percentage of metaslabs are "spoiled",
even though overall pool used space is low.
• Once in this state, only solu=on is bulk data rewrite.
• Happens sooner if you have fewer/smaller slabs.
• Happens sooner if you increase inges=on rate.
Our workload is (someNmes) pathologically bad at scale
Fragging Rights | Eric Sproul | March 17, 2017
"Doctor, it hurts when I do <this>..."

"Then don't do that."
What We Learned
• Use in-memory DB to accumulate incoming data
• Batch-update columnar files with large, sequen=al writes
• Eventually replace columnar files with some other DB format
Avoid those single-column updates
Fragging Rights | Eric Sproul | March 17, 2017
Follow doctor's orders
What We're Doing About It
QuesFons?
Eric Sproul, Circonus
ZFS User Conference
March 2017
Thanks for listening!
@eirescot

Más contenido relacionado

La actualidad más candente

Large scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in BioinformaticsLarge scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in BioinformaticsNtino Krampis
 
SVs hackathon group report
SVs hackathon group reportSVs hackathon group report
SVs hackathon group reportFritz Sedlazeck
 
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and SparkFOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and SparkRob Emanuele
 
FPL15 talk: Deep Convolutional Neural Network on FPGA
FPL15 talk: Deep Convolutional Neural Network on FPGAFPL15 talk: Deep Convolutional Neural Network on FPGA
FPL15 talk: Deep Convolutional Neural Network on FPGAHiroki Nakahara
 
Riding the Elephant - Hadoop 2.0
Riding the Elephant - Hadoop 2.0Riding the Elephant - Hadoop 2.0
Riding the Elephant - Hadoop 2.0Simon Elliston Ball
 
LHCb Computing Workshop 2018: PV finding with CNNs
LHCb Computing Workshop 2018: PV finding with CNNsLHCb Computing Workshop 2018: PV finding with CNNs
LHCb Computing Workshop 2018: PV finding with CNNsHenry Schreiner
 
Climate data in r with the raster package
Climate data in r with the raster packageClimate data in r with the raster package
Climate data in r with the raster packageAlberto Labarga
 
Super COMPUTING Journal
Super COMPUTING JournalSuper COMPUTING Journal
Super COMPUTING JournalPandey_G
 
A Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGaA Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGaHiroki Nakahara
 
An NSA Big Graph experiment
An NSA Big Graph experimentAn NSA Big Graph experiment
An NSA Big Graph experimentTrieu Nguyen
 
Real-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/PhiReal-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/PhiKarel Ha
 
Surface-related multiple elimination through orthogonal encoding in the laten...
Surface-related multiple elimination through orthogonal encoding in the laten...Surface-related multiple elimination through orthogonal encoding in the laten...
Surface-related multiple elimination through orthogonal encoding in the laten...Oleg Ovcharenko
 
Auscert Finding needles in haystacks (the size of countries)
Auscert Finding needles in haystacks (the size of countries)Auscert Finding needles in haystacks (the size of countries)
Auscert Finding needles in haystacks (the size of countries)packetloop
 
Graph Regularised Hashing
Graph Regularised HashingGraph Regularised Hashing
Graph Regularised HashingSean Moran
 
Ch 5: Introduction to heap overflows
Ch 5: Introduction to heap overflowsCh 5: Introduction to heap overflows
Ch 5: Introduction to heap overflowsSam Bowne
 
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian ApproachAutomatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian ApproachSpark Summit
 
Jonathan Lefman presents his work on Superresolution chemical microscopy
Jonathan Lefman presents his work on Superresolution chemical microscopyJonathan Lefman presents his work on Superresolution chemical microscopy
Jonathan Lefman presents his work on Superresolution chemical microscopyJonathan Lefman
 
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...InfluxData
 
Ai Forum at Computex 2017 - Keynote Slides by Jensen Huang
Ai Forum at Computex 2017 - Keynote Slides by Jensen HuangAi Forum at Computex 2017 - Keynote Slides by Jensen Huang
Ai Forum at Computex 2017 - Keynote Slides by Jensen HuangNVIDIA Taiwan
 

La actualidad más candente (20)

Large scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in BioinformaticsLarge scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in Bioinformatics
 
SVs hackathon group report
SVs hackathon group reportSVs hackathon group report
SVs hackathon group report
 
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and SparkFOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
 
FPL15 talk: Deep Convolutional Neural Network on FPGA
FPL15 talk: Deep Convolutional Neural Network on FPGAFPL15 talk: Deep Convolutional Neural Network on FPGA
FPL15 talk: Deep Convolutional Neural Network on FPGA
 
Riding the Elephant - Hadoop 2.0
Riding the Elephant - Hadoop 2.0Riding the Elephant - Hadoop 2.0
Riding the Elephant - Hadoop 2.0
 
C07.heaps
C07.heapsC07.heaps
C07.heaps
 
LHCb Computing Workshop 2018: PV finding with CNNs
LHCb Computing Workshop 2018: PV finding with CNNsLHCb Computing Workshop 2018: PV finding with CNNs
LHCb Computing Workshop 2018: PV finding with CNNs
 
Climate data in r with the raster package
Climate data in r with the raster packageClimate data in r with the raster package
Climate data in r with the raster package
 
Super COMPUTING Journal
Super COMPUTING JournalSuper COMPUTING Journal
Super COMPUTING Journal
 
A Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGaA Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGa
 
An NSA Big Graph experiment
An NSA Big Graph experimentAn NSA Big Graph experiment
An NSA Big Graph experiment
 
Real-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/PhiReal-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/Phi
 
Surface-related multiple elimination through orthogonal encoding in the laten...
Surface-related multiple elimination through orthogonal encoding in the laten...Surface-related multiple elimination through orthogonal encoding in the laten...
Surface-related multiple elimination through orthogonal encoding in the laten...
 
Auscert Finding needles in haystacks (the size of countries)
Auscert Finding needles in haystacks (the size of countries)Auscert Finding needles in haystacks (the size of countries)
Auscert Finding needles in haystacks (the size of countries)
 
Graph Regularised Hashing
Graph Regularised HashingGraph Regularised Hashing
Graph Regularised Hashing
 
Ch 5: Introduction to heap overflows
Ch 5: Introduction to heap overflowsCh 5: Introduction to heap overflows
Ch 5: Introduction to heap overflows
 
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian ApproachAutomatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
 
Jonathan Lefman presents his work on Superresolution chemical microscopy
Jonathan Lefman presents his work on Superresolution chemical microscopyJonathan Lefman presents his work on Superresolution chemical microscopy
Jonathan Lefman presents his work on Superresolution chemical microscopy
 
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
 
Ai Forum at Computex 2017 - Keynote Slides by Jensen Huang
Ai Forum at Computex 2017 - Keynote Slides by Jensen HuangAi Forum at Computex 2017 - Keynote Slides by Jensen Huang
Ai Forum at Computex 2017 - Keynote Slides by Jensen Huang
 

Destacado

Startups in Brazil and Latin America - SXSW 2017
Startups in Brazil and Latin America - SXSW 2017Startups in Brazil and Latin America - SXSW 2017
Startups in Brazil and Latin America - SXSW 2017Bruno Peroni
 
Amazon AI (March 2017)
Amazon AI (March 2017)Amazon AI (March 2017)
Amazon AI (March 2017)Julien SIMON
 
ハイブリッドクラウドの現実とAzureの使いどころ
ハイブリッドクラウドの現実とAzureの使いどころハイブリッドクラウドの現実とAzureの使いどころ
ハイブリッドクラウドの現実とAzureの使いどころToru Makabe
 
What is Deep Learning?
What is Deep Learning?What is Deep Learning?
What is Deep Learning?NVIDIA
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with DataSeth Familian
 
Introduction to Docker
Introduction to DockerIntroduction to Docker
Introduction to DockerAdam Štipák
 
Introduction to Domain Driven Design (Webtlak #7)
Introduction to Domain Driven Design (Webtlak #7)Introduction to Domain Driven Design (Webtlak #7)
Introduction to Domain Driven Design (Webtlak #7)Adam Štipák
 
#TorontoHR Meetup: How to speak CEO | TemboStatus
#TorontoHR Meetup:  How to speak CEO | TemboStatus#TorontoHR Meetup:  How to speak CEO | TemboStatus
#TorontoHR Meetup: How to speak CEO | TemboStatusTemboStatus
 
Artificial inteligence
Artificial inteligenceArtificial inteligence
Artificial inteligenceankit dubey
 
MuleSoft London Community - API Marketing, Culture Change and Tooling
MuleSoft London Community - API Marketing, Culture Change and ToolingMuleSoft London Community - API Marketing, Culture Change and Tooling
MuleSoft London Community - API Marketing, Culture Change and ToolingPace Integration
 
Profissões do futuro [ou o futuro das Profissões?]
Profissões do futuro [ou o futuro das Profissões?]Profissões do futuro [ou o futuro das Profissões?]
Profissões do futuro [ou o futuro das Profissões?]Pedro Ramos
 
Web crawl with Elixir
Web crawl with ElixirWeb crawl with Elixir
Web crawl with Elixir이재철
 
Socialytics: Accelerating IBM Connections Adoption with Watson Analytics
Socialytics: Accelerating IBM Connections Adoption with Watson AnalyticsSocialytics: Accelerating IBM Connections Adoption with Watson Analytics
Socialytics: Accelerating IBM Connections Adoption with Watson AnalyticsFemke Goedhart
 
Drive Digital Transformation with Innovation
Drive Digital Transformation with InnovationDrive Digital Transformation with Innovation
Drive Digital Transformation with InnovationPerficient, Inc.
 
Docker for developers
Docker for developersDocker for developers
Docker for developersAnvay Patil
 
Dockercon2015 bamboo
Dockercon2015 bambooDockercon2015 bamboo
Dockercon2015 bambooSteve Smith
 
DevOps and Continuous Delivery reference architectures for Docker
DevOps and Continuous Delivery reference architectures for DockerDevOps and Continuous Delivery reference architectures for Docker
DevOps and Continuous Delivery reference architectures for DockerSonatype
 

Destacado (20)

Startups in Brazil and Latin America - SXSW 2017
Startups in Brazil and Latin America - SXSW 2017Startups in Brazil and Latin America - SXSW 2017
Startups in Brazil and Latin America - SXSW 2017
 
Amazon AI (March 2017)
Amazon AI (March 2017)Amazon AI (March 2017)
Amazon AI (March 2017)
 
ハイブリッドクラウドの現実とAzureの使いどころ
ハイブリッドクラウドの現実とAzureの使いどころハイブリッドクラウドの現実とAzureの使いどころ
ハイブリッドクラウドの現実とAzureの使いどころ
 
What is Deep Learning?
What is Deep Learning?What is Deep Learning?
What is Deep Learning?
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with Data
 
Craftsmanship
CraftsmanshipCraftsmanship
Craftsmanship
 
Introduction to Docker
Introduction to DockerIntroduction to Docker
Introduction to Docker
 
Introduction to Domain Driven Design (Webtlak #7)
Introduction to Domain Driven Design (Webtlak #7)Introduction to Domain Driven Design (Webtlak #7)
Introduction to Domain Driven Design (Webtlak #7)
 
#TorontoHR Meetup: How to speak CEO | TemboStatus
#TorontoHR Meetup:  How to speak CEO | TemboStatus#TorontoHR Meetup:  How to speak CEO | TemboStatus
#TorontoHR Meetup: How to speak CEO | TemboStatus
 
Zfs intro v2
Zfs intro v2Zfs intro v2
Zfs intro v2
 
Artificial inteligence
Artificial inteligenceArtificial inteligence
Artificial inteligence
 
MuleSoft London Community - API Marketing, Culture Change and Tooling
MuleSoft London Community - API Marketing, Culture Change and ToolingMuleSoft London Community - API Marketing, Culture Change and Tooling
MuleSoft London Community - API Marketing, Culture Change and Tooling
 
Profissões do futuro [ou o futuro das Profissões?]
Profissões do futuro [ou o futuro das Profissões?]Profissões do futuro [ou o futuro das Profissões?]
Profissões do futuro [ou o futuro das Profissões?]
 
Web crawl with Elixir
Web crawl with ElixirWeb crawl with Elixir
Web crawl with Elixir
 
Socialytics: Accelerating IBM Connections Adoption with Watson Analytics
Socialytics: Accelerating IBM Connections Adoption with Watson AnalyticsSocialytics: Accelerating IBM Connections Adoption with Watson Analytics
Socialytics: Accelerating IBM Connections Adoption with Watson Analytics
 
Drive Digital Transformation with Innovation
Drive Digital Transformation with InnovationDrive Digital Transformation with Innovation
Drive Digital Transformation with Innovation
 
Docker for developers
Docker for developersDocker for developers
Docker for developers
 
Dockercon2015 bamboo
Dockercon2015 bambooDockercon2015 bamboo
Dockercon2015 bamboo
 
Deep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDBDeep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDB
 
DevOps and Continuous Delivery reference architectures for Docker
DevOps and Continuous Delivery reference architectures for DockerDevOps and Continuous Delivery reference architectures for Docker
DevOps and Continuous Delivery reference architectures for Docker
 

Similar a Fragging Rights: A Tale of a Pathological Storage Workload

Lustre Generational Performance Improvements & New Features
Lustre Generational Performance Improvements & New FeaturesLustre Generational Performance Improvements & New Features
Lustre Generational Performance Improvements & New Featuresinside-BigData.com
 
FAQ on Dedupe NetApp
FAQ on Dedupe NetAppFAQ on Dedupe NetApp
FAQ on Dedupe NetAppAshwin Pawar
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.
 
Data structures assignmentweek4b.pdfCI583 Data Structure
Data structures assignmentweek4b.pdfCI583 Data StructureData structures assignmentweek4b.pdfCI583 Data Structure
Data structures assignmentweek4b.pdfCI583 Data StructureOllieShoresna
 
Distributed and Streaming Graph Processing Techniques
Distributed and Streaming Graph Processing TechniquesDistributed and Streaming Graph Processing Techniques
Distributed and Streaming Graph Processing TechniquesPanagiotis Liakos
 
High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018Zahari Dichev
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?Jeremy Schneider
 
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechGeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechRob Emanuele
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Databricks
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationRob Emanuele
 
Migrating the elastic stack to the cloud, or application logging @ travix
 Migrating the elastic stack to the cloud, or application logging @ travix Migrating the elastic stack to the cloud, or application logging @ travix
Migrating the elastic stack to the cloud, or application logging @ travixRuslan Lutsenko
 
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Daniel Lemire
 
Building maps for apps in the cloud - a Softlayer Use Case
Building maps for  apps in the cloud - a Softlayer Use CaseBuilding maps for  apps in the cloud - a Softlayer Use Case
Building maps for apps in the cloud - a Softlayer Use CaseTiman Rebel
 
Ceph Day Netherlands - Ceph @ BIT
Ceph Day Netherlands - Ceph @ BIT Ceph Day Netherlands - Ceph @ BIT
Ceph Day Netherlands - Ceph @ BIT Ceph Community
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 

Similar a Fragging Rights: A Tale of a Pathological Storage Workload (20)

Lustre Generational Performance Improvements & New Features
Lustre Generational Performance Improvements & New FeaturesLustre Generational Performance Improvements & New Features
Lustre Generational Performance Improvements & New Features
 
FAQ on Dedupe NetApp
FAQ on Dedupe NetAppFAQ on Dedupe NetApp
FAQ on Dedupe NetApp
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
 
Data structures assignmentweek4b.pdfCI583 Data Structure
Data structures assignmentweek4b.pdfCI583 Data StructureData structures assignmentweek4b.pdfCI583 Data Structure
Data structures assignmentweek4b.pdfCI583 Data Structure
 
Distributed and Streaming Graph Processing Techniques
Distributed and Streaming Graph Processing TechniquesDistributed and Streaming Graph Processing Techniques
Distributed and Streaming Graph Processing Techniques
 
High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?
 
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechGeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
 
SIP Tutorial/Workshop 3
SIP Tutorial/Workshop 3SIP Tutorial/Workshop 3
SIP Tutorial/Workshop 3
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
Migrating the elastic stack to the cloud, or application logging @ travix
 Migrating the elastic stack to the cloud, or application logging @ travix Migrating the elastic stack to the cloud, or application logging @ travix
Migrating the elastic stack to the cloud, or application logging @ travix
 
An Optics Life
An Optics LifeAn Optics Life
An Optics Life
 
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
 
Building maps for apps in the cloud - a Softlayer Use Case
Building maps for  apps in the cloud - a Softlayer Use CaseBuilding maps for  apps in the cloud - a Softlayer Use Case
Building maps for apps in the cloud - a Softlayer Use Case
 
Ceph Day Netherlands - Ceph @ BIT
Ceph Day Netherlands - Ceph @ BIT Ceph Day Netherlands - Ceph @ BIT
Ceph Day Netherlands - Ceph @ BIT
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
L15.pdf
L15.pdfL15.pdf
L15.pdf
 

Último

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Fragging Rights: A Tale of a Pathological Storage Workload

  • 1. A Tale of a Pathological Storage Workload Eric Sproul, Circonus ZFS User Conference March 2017 Fragging Rights
  • 2. Fragging Rights | Eric Sproul | March 17, 2017 Circonus Release Engineer Wearer of other hats as well (ops, sales eng., support) ZFS user since Solaris 10u2 (>10 years ago now??) Helped bring OmniOS to the world @ OmniTI Eric Sproul @eirescot esproul
  • 3. • How ZFS manages free space • Our workload • Problem manifesta=on • Lessons learned and future direc=on We’ll talk about free space, COW, and the hole we dug for ourselves: Fragging Rights | Eric Sproul | March 17, 2017 ZFS handled it well… unFl it didn’t. A Tale of Surprise and Pain
  • 4. Fragging Rights | Eric Sproul | March 17, 2017 ZFS tracks free space with space maps Time-ordered log of allocaNons and frees Top-level vdevs divided into a few hundred metaslabs, each with its own space map At metaslab load, the map is read into AVL tree in memory Free Space in ZFS For details, see Jeff Bonwick’s excellent explanaFon of space maps:
 hOp://web.archive.org/web/20080616104156/hOp://blogs.sun.com/bonwick/entry/space_maps
  • 5. Fragging Rights | Eric Sproul | March 17, 2017 Never overwrite exisNng data. “UpdaNng a file” means new bits wriXen to previously free space, followed by freeing of old chunk. Copy-on-Write
  • 6. Scalable =me-series data store Compact on-disk format while not sacrificing granularity Enable rich analysis of historical telemetry data Store value plus 6 other pieces of info: • variance of value (stddev) • deriva=ve (change over =me) and stddev of deriva=ve • counter (non-nega=ve deriva=ve) and stddev of counter • count of samples that were averaged to produce value Fragging Rights | Eric Sproul | March 17, 2017 “It seemed like a good idea at the Fme…” The Workload
  • 7. Fragging Rights | Eric Sproul | March 17, 2017 File format is columnar and offset-based, with 32-byte records. Common case is appending to the end (most recent measurement/rollup). No synchronous semanNcs used for updaNng data files (this is a cluster and we have write-ahead logs). 5x 1m records = 32 * 5 = 160 bytes append 1x 5m rollup = 32 bytes append 1x 3h rollup = 32 bytes append, then update w/ recalculated rollup The Workload
  • 8. Fragging Rights | Eric Sproul | March 17, 2017 With copy-on-write, every Circonus record append or overwrite modifies a ZFS block, freeing the old copy of that block. ZFS has variable block sizes: minimum is a single disk sector (512b/4K), max is recordsize property (we used 8K). When the tail block reaches 8K, it stops gedng updated and a new tail block starts. The Workload
  • 9. Fragging Rights | Eric Sproul | March 17, 2017 ZFS sees tail block updated every 5 minutes for: 1m files: 8192/160 = 51 Nmes (~4 hours) 5m files: 8192/32 = 256 Nmes (~21 hours) 3h files: tragic • last record wriXen, then rewriXen 35 more Nmes
 (as 3h rollup value gets recalculated) • 256 records to fill 8K, ZFS sees 256*36 = 9216 block updates (32 days) Alas, we also use compression, so it’s actually ~2x worse than this. The Workload
  • 10. Fragging Rights | Eric Sproul | March 17, 2017 Aher “some Nme”, depending on ingesNon volume and pool/vdev size, performance degrades swihly. TXGs take longer and longer to process, stalling the ZIO pipeline. ZIO latency bubbles up to userland; applicaNon sees increasing write syscall latency, lowering throughput. Customers are sad. 😔 The Problem
  • 11. Fragging Rights | Eric Sproul | March 17, 2017 Start with what the app sees: DTrace syscalls (us) syscall: pwrite bytes: 32 value ------------- Distribution ------------- count 4 | 0 8 | 878 16 |@@@@@@@@@@@@ 27051 32 |@@@@@@@@@@@@@@@ 34310 64 |@@@@@@ 13361 128 |@ 2586 256 | 148 512 | 53 1024 | 39 2048 | 33 4096 | 82 8192 | 534 16384 |@@@@ 8614 32768 | 474 65536 | 22 131072 | 0 262144 | 0 524288 | 0 1048576 | 36 2097152 | 335 4194304 | 72 8388608 | 0 Troubleshoo=ng lolwut?!? bad App VFS ZFS (disks) userland kernel syscall
  • 12. Fragging Rights | Eric Sproul | March 17, 2017 Slow writes must mean disks are saturated, right? extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b data 918.3 2638.5 4644.3 32217.0 1362.7 11.8 386.4 21 40 rpool 0.9 4.4 3.5 22.3 0.0 0.0 2.0 0 0 sd0 0.9 4.4 3.5 22.3 0.0 0.0 0.5 0 0 sd6 67.6 175.1 347.9 2299.8 0.0 0.6 2.6 0 13 sd7 64.4 225.5 335.9 2300.7 0.0 0.7 2.3 0 14 sd8 67.3 167.8 314.5 2300.4 0.0 0.6 2.6 0 13 sd9 65.3 173.8 326.7 2299.8 0.0 0.6 2.6 0 13 sd10 66.1 226.6 332.2 2300.7 0.0 0.6 2.2 0 13 sd11 67.2 153.9 338.8 2301.4 0.0 0.4 2.0 0 11 sd12 69.4 154.6 345.5 2301.4 0.0 0.4 2.0 0 11 sd13 64.0 162.0 321.9 2300.8 0.0 0.4 2.0 0 11 sd14 65.5 163.7 328.7 2300.8 0.0 0.4 2.0 0 11 sd15 64.5 221.4 343.9 2303.5 0.0 0.8 2.7 0 15 sd16 61.1 222.6 318.1 2303.5 0.0 0.8 2.8 0 15 sd17 63.5 211.3 338.1 2303.2 0.0 0.7 2.7 0 14 sd18 63.8 213.0 330.6 2303.2 0.0 0.7 2.6 0 14 sd19 68.7 170.0 321.9 2300.4 0.0 0.6 2.5 0 13 Troubleshoo=ng App VFS ZFS (disks) userland kernel iostat -x
  • 13. Fragging Rights | Eric Sproul | March 17, 2017 We know the problem is in the write path, but it's not the disks. What Now? TL;DR is that ZFS internals are very complicated and difficult to reason about, especially with legacy ZFS code (OmniOS r151006, circa 2013). We flailed around for a while, using DTrace to generate kernel flame graphs, to get a sense of what the kernel was up to.
  • 14. Fragging Rights | Eric Sproul | March 17, 2017 Flame graphs! Kernel Stack Profiling System mostly idle (not shown), but when not, it's ZFS. hXps://github.com/brendangregg/FlameGraph
  • 15. Fragging Rights | Eric Sproul | March 17, 2017 >1s to find free space?!? metaslab_alloc time (nsec) value ------------- Distribution ------------- count 1024 | 0 2048 | 1166 4096 |@@@@@ 41714 8192 |@@@@@@@@@@@@@@@@ 134407 16384 |@@@@@@@@@@@@@ 107140 32768 |@@ 17603 65536 |@ 10066 131072 |@ 10315 262144 |@ 8144 524288 | 3715 1048576 | 1598 2097152 | 581 4194304 | 75 8388608 | 49 16777216 | 0 33554432 | 0 67108864 | 0 134217728 | 0 268435456 | 0 536870912 | 0 1073741824 | 1276 2147483648 | 0 Slab Alloca=on App VFS ZFS (disks) userland kernel metaslab_alloc
  • 16. • pwrite(2) syscall is slow. • Slow because ZFS takes so long to allocate new space for writes. • We don't know precisely why these alloca=ons are bogged down,
 but it likely involves the nature of the pool's free space. StarNng with applicaNon-perceived latency: Fragging Rights | Eric Sproul | March 17, 2017 What do we know now? Troubleshoo=ng: Recap
  • 17. Fragging Rights | Eric Sproul | March 17, 2017 Aggregate free space isn't everything Needed to visualize exisNng metaslab allocaNons: Quan=ty vs. Quality # zdb -m data Metaslabs: vdev 0 metaslabs 116 offset spacemap free --------------- ------------------- --------------- ------------- metaslab 0 offset 0 spacemap 38 free 2.35G metaslab 1 offset 100000000 spacemap 153 free 2.19G metaslab 2 offset 200000000 spacemap 156 free 1.70G metaslab 3 offset 300000000 spacemap 158 free 639M metaslab 4 offset 400000000 spacemap 160 free 1.11G
  • 18. Fragging Rights | Eric Sproul | March 17, 2017 Visualizing Metaslabs Credit for the idea: hOps://www.delphix.com/blog/delphix-engineering/zfs-write-performance-impact-fragmentaFon Color indicates fullness; Green = empty Red = full Percentage is remaining free space.
  • 19. Fragging Rights | Eric Sproul | March 17, 2017 Visualizing Metaslabs Aher data rewrite, allocaNons are Nghter. Performance problem is gone. Did we just create "ZFS Defrag"?
  • 20. Fragging Rights | Eric Sproul | March 17, 2017 From one slab, on one vdev, using `zdb -mmm`: Spacemap Detail Metaslabs: vdev 0 metaslabs 116 offset spacemap free --------------- ------------------- --------------- ------------- metaslab 39 offset 9c00000000 spacemap 357 free 13.2G segments 1310739 maxsize 168K freepct 82% [ 0] ALLOC: txg 7900863, pass 1 [ 1] A range: 9c00000000-9c00938800 size: 938800 [ 2] A range: 9c00939000-9c0093d200 size: 004200 ... [4041229] FREE: txg 7974611, pass 1 [4041230] F range: 9d79d10600-9d79d10800 size: 000200 [4041231] F range: 9e84efe400-9e84efe600 size: 000200 [4041232] FREE: txg 7974612, pass 1 [4041233] F range: 9e72ba4600-9e72ba5400 size: 000e00 90th %ile freed size: ~9K 31% of frees are 512b-1K >4M records in one spacemap...
 Costly to load, only to discover you can't allocate from it!
 Probably contributes to those long metaslab_alloc() =mes.
  • 21. • Spacemap histograms • Visible via zdb (-mm) and mdb (::spa -mh) • Metaslab fragmenta=on metrics • Allocator changes to account for fragmenta=on • New tuning knobs* for write thro_le Since our iniNal foray into this issue, new features have come out: Fragging Rights | Eric Sproul | March 17, 2017 We weren't alone! OpenZFS Improvements * hOp://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/
  • 22. Fragging Rights | Eric Sproul | March 17, 2017 metaslab 2 offset 400000000 spacemap 227 free 7.99G On-disk histogram: fragmentation 90 9: 632678 **************************************** 10: 198275 ************* 11: 342565 ********************** 12: 460625 ****************************** 13: 213397 ************** 14: 82860 ****** 15: 9774 * 16: 137 * 17: 1 * Spacemap Histogram Key is power-of-2 range size FragmentaNon metric is based on this distribuNon
  • 23. • Performance is fine un=l some percentage of metaslabs are "spoiled", even though overall pool used space is low. • Once in this state, only solu=on is bulk data rewrite. • Happens sooner if you have fewer/smaller slabs. • Happens sooner if you increase inges=on rate. Our workload is (someNmes) pathologically bad at scale Fragging Rights | Eric Sproul | March 17, 2017 "Doctor, it hurts when I do <this>..."
 "Then don't do that." What We Learned
  • 24. • Use in-memory DB to accumulate incoming data • Batch-update columnar files with large, sequen=al writes • Eventually replace columnar files with some other DB format Avoid those single-column updates Fragging Rights | Eric Sproul | March 17, 2017 Follow doctor's orders What We're Doing About It
  • 25. QuesFons? Eric Sproul, Circonus ZFS User Conference March 2017 Thanks for listening! @eirescot