I’ve noticed that exception handling uses a very different amount of memory based on whether I’m using the backtrace. For example, if I simply log an exception, my service hold steady at 5-6MB of RAM in production for days, but if I log the backtrace, it uses ~50MB of RAM.
I’ve looked at the Exception::CallStack#decode_backtrace method to find out what could be using so much RAM here, but nothing is jumping out at me. When it comes to reporting errors, getting stack traces is useful, but if it uses an order of magnitude more RAM per process as soon as it hits its first error it’s almost not worth it.
I had a look and I’m not sure that function would be the source of the problem I’m seeing. The exception has a callstack whether or not I inspect it, so I think that unw_step function would already have been called at that point even if I didn’t inspect it, right? Assuming I’m understanding the purpose of that function correctly.
I feel like this may be a separate issue with decoding the stack frames referenced by the elements in this array.
Hunch (without checking) is that it loads all the dwarf info into an internal structure (@@dwarf_line_numbers), maybe it only triggers that when you display the stack trace?
@rogerdpack Nice find, this seems to be exactly what it is. The load_debug_info method is called from several methods invoked inside Exception::CallStack#decode_backtrace to memoize the debug info, so setting CRYSTAL_LOAD_DEBUG_INFO=1 just changes when that method is called and the data loaded. I set CRYSTAL_LOAD_DEBUG_INFO=1 and as soon as the process started it was using ~50MB of RAM, and getting an exception made no meaningful difference.
Unfortunately, this means that there is a choice between using an order of magnitude more memory for small services vs making exceptions slower any time the app reads backtraces as it reloads the data from disk every time. Classic space-vs-time tradeoff.
On the bright side, the monkeypatch to release this memory (on Linux/macOS/BSD, at least) probably looks something like this:
struct Exception::CallStack
private def decode_backtrace
previous_def.tap do
@@dwarf_loaded = false
@@dwarf_line_numbers = nil
@@dwarf_function_names = nil
end
end
end
I guess jgaskins is referring to monkey patching it to make it release somehow…which is an interesting stop gap measure until something else can be created like cacheing what is useful?
It is indeed being cached. The cache takes up 10x as much memory as the rest of the program uses. That’s the tradeoff I was mentioning.
The monkeypatch I mentioned was to clear that cache to hopefully release that memory. I don’t know if it’ll be a good idea, though. Specifically, I don’t know if it’ll release enough memory back to the OS to be worth it or if the GC will hold onto that memory since those heap pages may not be empty anymore and Boehm doesn’t compact, but also it may be prohibitively slow if an app sees a lot of exceptions that have their call stacks inspected — for example, when reporting it to an error tracker like Honeybadger.
Ideally, apps should see very few of them (they’re supposed to be exceptional), and a slow disk read is probably fine when they’re few and far between, but you also don’t want a buggy deploy to wreck performance. Hard to know. I’ll run some benchmarks on it.
Another dimension for trade-offs is reducing the amount of debug information in the binary. If you have less debug info, there’s less to load. But as a result, you have less debug info for debugging.
Then I do wonder if all of this is done optimally. Is there any other language that loads this info at runtime and stores it in memory? It might be the case that the data is arranged in the file in a way to make access to it fast (for example if you need to compute a backtrace given some bits you got from unwinding) so there might never be a need to store the entire thing in memory.
The root path does nothing so I can get a baseline for memory overhead. The /oops path panics. Both hit a plateau of just under 12MB of RAM after 10k requests, and the panic does show the source location of each stack frame:
This seems to indicate that there is a more efficient way of looking up this info without holding it all in RAM. Looking into the Go panic code to see how it gets this info.