Last summer, my team and I faced a question many young startups face. Should we rewrite our system in Rust?
At the time of the decision, we were primarily writing in Go. I was working on an agent that passively watches network traffic, parses API calls, and sends obfuscated summaries back to our service for analysis. As users were starting to run more traffic through us, memory usage by the agent grew to an unacceptably high level, impacting performance.
This led me to spend 25 days in despair and immerse myself in the details of Go’s memory management, our technology stack, and the profiling tools available – trying to get our memory footprint back under control. Go’s fully automatic memory management makes this no easy feat.
Spoiler: I emerged victorious and our team still uses Go. In this talk, I’ll talk about key steps and lessons learned from my project. I intend this talk to be helpful for people curious about reducing their memory footprint in Go, or anybody wondering about the tradeoffs of switching to or from Go.
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Taming Go's Memory Usage — and Avoiding a Rust Rewrite
1. Brought to you by
Taming Go's Memory Usage –
and Avoiding a Rust Rewrite
Mark Gritter
Founding Engineer at Akita Software
2. Mark Gritter
Founding Engineer at Akita Software
■ Previously built VM-aware flash storage arrays at Tintri
■ Trying to build tools that help developers with performance
and correctness!
■ Hobbies: gardening, weaving, math
7. Options
1. Impose a cap on how much
memory is used; just restart
when we go over.
… can’t do that from within Go. And the system
administrators we talked to suggested
per-container limits had bad behavior.
8. Options
1. Impose a cap on how much
memory is used; just restart
when we go over.
2. Rewrite the whole thing in a
language that gave us more
control over memory
management.
… can’t do that from within Go. And the system
administrators we talked to suggested
per-container limits had bad behavior.
… how many months of effort? And no
guarantee of success at the end. We didn’t have
a nice packet-handling library like go-packet
ready to drop in.
9. Options
1. Impose a cap on how much
memory is used; just restart
when we go over.
2. Rewrite the whole thing in a
language that gave us more
control over memory
management.
3. Find and fix everything we were
doing wrong.
… can’t do that from within Go. And the system
administrators we talked to suggested
per-container limits had bad behavior.
… how many months of effort? And no
guarantee of success at the end. We didn’t have
a nice packet-handling library like go-packet
ready to drop in.
… unknown scope of effort.
11. Go’s Garbage Collector
Primary focus: low performance overhead! (and it’s good at that!)
On the other hand: very few knobs to turn; tool ecosystem less mature than Java.
■ GOGC = what percentage of live memory to allocate before starting next
sweep.
■ For example, if live data is 200MB, then GOGC=100% means we can allocate
200MB before any memory is reclaimed.
■ Default setting means RSS >= 2 * heap size, at least.
Last live memory New allocations
Allocated during GC
12. Go’s Built-in Profiling
Go’s built-in pprof support can measure:
■ Size of live heap, at time of measurement (inuse_space, inuse_objects)
■ Allocations made since program start (alloc_space, alloc_objects)
For each of those, a call-stack (of limited depth) is available showing which
function calls led to which memory allocations.
■ This is not always what you want to know: sometimes what you need is:
which object is keeping those allocations live!
13. Page cache?
Go-packet is generally
good about not doing
excess copies.
But, when a packet is
missing we need a
place to store data until
it is (hopefully)
retransmitted.
14. First Fixes
■ Limit total page cache size
■ Limit per-connection buffering to only about a few RTT’s worth
■ Upgrade to a newer version that releases page-cache entries back to the
heap. (This won’t help with spikes, but will ensure they aren’t permanent.)
■ More aggressively expire TCP connections and flush partial data to the
parsing layer.
15. 1 GiB is Better than 6 GiB Any Day, but Not Good
Heap numbers looked good, but spikes persisted.
17. This is Where it Gets Murky
The allocation profile shows lots of hot-spots that are allocating lots of memory.
But are they contributing to spikes in memory use?
Interpretation: 8.5ms stop-the-world, 79ms concurrent mark and scan, 0.049ms
“mark termination”. The heap was 88 MB at the start of the sweep, 102 MB at the
end of GC, and contained 77 MB of live data.
Conclusion: we allocated 20% of our heap in just the 79ms the GC was running!
2021-08-03T22:21:51.946Z,i-049b3fd0dde1cf672,cli,"gc 504 @713.457s 0%:
8.5+79+0.049 ms clock, 34+13/78/78+0.19 ms cpu, 88->102->77 MB, 95 MB goal, 4 P"
18. Progress Without Progress
Nodes drop of the allocation tree, but I can still see the spikes in DataDog.
■ Remove re-initialization of regular expressions.
■ Rewrite our visitor to use a pre-allocated stack rather than allocating objects
every time it recursed.
● Then fix the subsidiary problems that this allocation was hiding.
● Lazily create slices on-demand rather than have them pre-built in the visitor context.
This suggests the GC was actually handling these fine! But perhaps this work was
necessary to understand the real causes.
19. Example
flat flat% sum% cum cum%
7562.56MB 27.14% 27.14% 7562.56MB 27.14% stackVisitorContext.appendPath
flat flat% sum% cum cum%
1225.56MB 5.99% 23.87% 2439.59MB 11.93% stackVisitorContext.EnterStruct
892.03MB 4.36% 33.36% 892.03MB 4.36% stackVisitorContext.appendPath
20. Simulating the Tool I Wished I Had
What I really want is: What were the allocations that led up to an increased RSS
size?
■ I don’t care if I allocate a lot of memory as long as the GC is good at reclaiming it.
■ I do care if I have to get more memory from the operating system, increasing my
footprint
Solution: Grab the heap dump periodically (I used every minute), wait for a spike,
look at difference in alloc_bytes between the two heap dumps.
normal spike
21. Biting the Bullet
One of our big contributors was objecthash-proto.
■ A library which hashed arbitrary protobufs (which we use for our IR).
■ It makes heavy use of reflection.
■ Reflection on a struct requires extensive memory allocation.
■ (Why? I don’t know, though I could make some guesses.)
Write a code generator to preserve same behavior, but hashing functions specific
to our protobufs.
BenchmarkWitnessHash-8 15476 76078 ns/op 18349 B/op 947 allocs/op
BenchmarkWitnessOldHash-8 7077 173922 ns/op 48664 B/op 1561 allocs/op
23. One More
Showing nodes accounting for 419.70MB, 87.98% of 477.03MB total
Dropped 129 nodes (cum <= 2.39MB)
Showing top 10 nodes out of 114
flat flat% sum% cum cum%
231.14MB 48.45% 48.45% 234.14MB 49.08% io.ReadAll
52.93MB 11.10% 59.55% 53.43MB 11.20% gopacket...ReadPacketData
51.45MB 10.79% 70.33% 123.88MB 25.97% gopacket...NextPacket
42.42MB 8.89% 79.23% 42.42MB 8.89% bytes.makeSlice
Half the allocations coming from one function?
This turned out to be a buffer between decompression and parsing the HTTP body.
26. < 280MiB
99.9th percentile, on our internal dogfooding
(But, unfortunately, we found some additional problem cases since then.)
27. Lessons
Reduce fixed overhead (every live byte in the heap costs two in RSS).
Profile allocation, not just fixed data.
Stream, don’t buffer.
Replace frequent, small allocations.
(This is the one that leads to the least idiomatic Go code).
Avoid generic libraries with unpredictable memory costs.
Find a way to simulate the tool you wish you had.
28. Brought to you by
Mark Gritter
mgritter@akitasoftware.com
@markgritter