Taming Go's Memory Usage — and Avoiding a Rust Rewrite

Brought to you by
Taming Go's Memory Usage –
and Avoiding a Rust Rewrite
Mark Gritter
Founding Engineer at Akita Software

Mark Gritter
Founding Engineer at Akita Software
■ Previously built VM-aware ﬂash storage arrays at Tintri
■ Trying to build tools that help developers with performance
and correctness!
■ Hobbies: gardening, weaving, math

The Akita Agent
libpcap
(packet
capture)
go-packet
(TCP
reassembly)
http
parsing &
translation
obfuscation upload

Oops
Agent Resident Set Size (RSS), GB, as reported by DataDog

The Goal: Predictable
& Stable Memory Usage

Options
1. Impose a cap on how much
memory is used; just restart
when we go over.
… can’t do that from within Go. And the system
administrators we talked to suggested
per-container limits had bad behavior.

Options
when we go over.
2. Rewrite the whole thing in a
language that gave us more
control over memory
management.
… how many months of effort? And no
guarantee of success at the end. We didn’t have
a nice packet-handling library like go-packet
ready to drop in.

Options
when we go over.
2. Rewrite the whole thing in a
language that gave us more
control over memory
management.
3. Find and ﬁx everything we were
doing wrong.
… how many months of effort? And no
guarantee of success at the end. We didn’t have
a nice packet-handling library like go-packet
ready to drop in.
… unknown scope of effort.

Understanding Go’s GC –
and Our Use of it!

Go’s Garbage Collector
Primary focus: low performance overhead! (and it’s good at that!)
On the other hand: very few knobs to turn; tool ecosystem less mature than Java.
■ GOGC = what percentage of live memory to allocate before starting next
sweep.
■ For example, if live data is 200MB, then GOGC=100% means we can allocate
200MB before any memory is reclaimed.
■ Default setting means RSS >= 2 * heap size, at least.
Last live memory New allocations
Allocated during GC

Go’s Built-in Proﬁling
Go’s built-in pprof support can measure:
■ Size of live heap, at time of measurement (inuse_space, inuse_objects)
■ Allocations made since program start (alloc_space, alloc_objects)
For each of those, a call-stack (of limited depth) is available showing which
function calls led to which memory allocations.
■ This is not always what you want to know: sometimes what you need is:
which object is keeping those allocations live!

Page cache?
Go-packet is generally
good about not doing
excess copies.
But, when a packet is
missing we need a
place to store data until
it is (hopefully)
retransmitted.

First Fixes
■ Limit total page cache size
■ Limit per-connection buffering to only about a few RTT’s worth
■ Upgrade to a newer version that releases page-cache entries back to the
heap. (This won’t help with spikes, but will ensure they aren’t permanent.)
■ More aggressively expire TCP connections and ﬂush partial data to the
parsing layer.

1 GiB is Better than 6 GiB Any Day, but Not Good
Heap numbers looked good, but spikes persisted.

Time to Look at Total Allocations

This is Where it Gets Murky
The allocation proﬁle shows lots of hot-spots that are allocating lots of memory.
But are they contributing to spikes in memory use?
Interpretation: 8.5ms stop-the-world, 79ms concurrent mark and scan, 0.049ms
“mark termination”. The heap was 88 MB at the start of the sweep, 102 MB at the
end of GC, and contained 77 MB of live data.
Conclusion: we allocated 20% of our heap in just the 79ms the GC was running!
2021-08-03T22:21:51.946Z,i-049b3fd0dde1cf672,cli,"gc 504 @713.457s 0%:
8.5+79+0.049 ms clock, 34+13/78/78+0.19 ms cpu, 88->102->77 MB, 95 MB goal, 4 P"

Progress Without Progress
Nodes drop of the allocation tree, but I can still see the spikes in DataDog.
■ Remove re-initialization of regular expressions.
■ Rewrite our visitor to use a pre-allocated stack rather than allocating objects
every time it recursed.
● Then ﬁx the subsidiary problems that this allocation was hiding.
● Lazily create slices on-demand rather than have them pre-built in the visitor context.
This suggests the GC was actually handling these ﬁne! But perhaps this work was
necessary to understand the real causes.

Example
flat flat% sum% cum cum%
7562.56MB 27.14% 27.14% 7562.56MB 27.14% stackVisitorContext.appendPath
1225.56MB 5.99% 23.87% 2439.59MB 11.93% stackVisitorContext.EnterStruct
892.03MB 4.36% 33.36% 892.03MB 4.36% stackVisitorContext.appendPath

Simulating the Tool I Wished I Had
What I really want is: What were the allocations that led up to an increased RSS
size?
■ I don’t care if I allocate a lot of memory as long as the GC is good at reclaiming it.
■ I do care if I have to get more memory from the operating system, increasing my
footprint
Solution: Grab the heap dump periodically (I used every minute), wait for a spike,
look at difference in alloc_bytes between the two heap dumps.
normal spike

Biting the Bullet
One of our big contributors was objecthash-proto.
■ A library which hashed arbitrary protobufs (which we use for our IR).
■ It makes heavy use of reflection.
■ Reflection on a struct requires extensive memory allocation.
■ (Why? I don’t know, though I could make some guesses.)
Write a code generator to preserve same behavior, but hashing functions specific
to our protobufs.
BenchmarkWitnessHash-8 15476 76078 ns/op 18349 B/op 947 allocs/op
BenchmarkWitnessOldHash-8 7077 173922 ns/op 48664 B/op 1561 allocs/op

Finally
This was the ﬁrst ﬁx that actually made a qualitative difference!

One More
Showing nodes accounting for 419.70MB, 87.98% of 477.03MB total
Dropped 129 nodes (cum <= 2.39MB)
Showing top 10 nodes out of 114
231.14MB 48.45% 48.45% 234.14MB 49.08% io.ReadAll
52.93MB 11.10% 59.55% 53.43MB 11.20% gopacket...ReadPacketData
51.45MB 10.79% 70.33% 123.88MB 25.97% gopacket...NextPacket
42.42MB 8.89% 79.23% 42.42MB 8.89% bytes.makeSlice
Half the allocations coming from one function?
This turned out to be a buffer between decompression and parsing the HTTP body.

< 280MiB
99.9th percentile, on our internal dogfooding
(But, unfortunately, we found some additional problem cases since then.)

Lessons
Reduce fixed overhead (every live byte in the heap costs two in RSS).
Profile allocation, not just fixed data.
Stream, don’t buffer.
Replace frequent, small allocations.
(This is the one that leads to the least idiomatic Go code).
Avoid generic libraries with unpredictable memory costs.
Find a way to simulate the tool you wish you had.

Brought to you by
Mark Gritter
mgritter@akitasoftware.com
@markgritter

Taming Go's Memory Usage — and Avoiding a Rust Rewrite

Recomendados

Recomendados

Más contenido relacionado

Similar a Taming Go's Memory Usage — and Avoiding a Rust Rewrite

Similar a Taming Go's Memory Usage — and Avoiding a Rust Rewrite (20)

Más de ScyllaDB

Más de ScyllaDB (20)

Último

Último (20)

Taming Go's Memory Usage — and Avoiding a Rust Rewrite