Memory profiling

To optimize a program’s memory usage, which might also impact its speed (allocating memory is slow compared to other things) I had this idea: what if there’s a callback of some sort any time an object is created? Then we can build code around that callback to gather stats.

Here’s a diff to the compiler’s source code that enables that:

--- a/src/compiler/crystal/semantic/new.cr
+++ b/src/compiler/crystal/semantic/new.cr
@@ -215,6 +215,9 @@ module Crystal
       exps << init
       exps << If.new(RespondsTo.new(obj.clone, "finalize").at(self),
         Call.new(Path.global("GC").at(self), "add_finalizer", obj.clone).at(self))
+      {% if env("TRACE_NEW") == "1" %}
+        exps << Call.new(nil, "__trace_new", args: [Var.new("self").at(self)] of ASTNode, global
+      {% end %}
       exps << obj

       # Forward block argument if any
@@ -244,6 +247,9 @@ module Crystal
       exps = Array(ASTNode).new(3)
       exps << assign
       exps << If.new(RespondsTo.new(var.clone, "finalize").at(loc), call).at(loc)
+      {% if env("TRACE_NEW") == "1" %}
+        exps << Call.new(nil, "__trace_new", args: [Var.new("self").at(loc)] of ASTNode, global:
+      {% end %}
       exps << var.clone

       a_def = Def.new("new", body: exps).at(loc)

That’s the simplest thing I came up with, but it could be a compiler flag or something.

With that, every initialize method that expands into a new class method will have a call to a global __trace_new method (this could be done differently.)

Of course that __trace_new method doesn’t exist, but we can create it!

Let’s try it…

def __trace_new(type : T.class) forall T
  LibC.printf("%s\n", T.name)
end

puts "Hello world!"

Now if you run that program with a compiler build with the above patch (and compiled with TRACE_NEW=1) you’ll see this output:

Array(Pointer(Bool))
Crystal::OnceState
Array(String)
Path
Iterator::Stop
Deque(Pointer(Void))
Thread::Mutex
Fiber::StackPool
Slice(T)
Slice(T)
Slice(UInt8)
(... and a lot more)

Those are all the objects being created!

But that’s not very useful as an output.

We can put the names into a hash and count them!

module Tracing
  class_property? running = false
  class_property(stats : Hash(String, Int32)) do
    Hash(String, Int32).new(0)
  end

  at_exit do
    Tracing.stats.to_a.sort_by(&.[1]).each do |type, count|
      puts "# #{type} #{count}"
    end
  end
end

def __trace_new(type : T.class) forall T
  return if Tracing.running?

  Tracing.running = true
  Tracing.stats[T.name] += 1
  Tracing.running = false
end

Now we get this output (I’ll show just the last bits):

Hash(Thread, Deque(Fiber)) 8
Crystal::ThreadLocalValue(Deque(Fiber)) 8
Hash(Thread, Crystal::Event) 8
Crystal::ThreadLocalValue(Crystal::Event) 8
Crystal::SpinLock 19

I had no idea the empty Crystal program created 19 spin locks!

Well, tracing memory allocation of structs doesn’t make sense, so we can stick a {% if T.class? %} there. Here’s the result:

IO::FileDescriptor 4
Hash(Thread, Deque(Fiber)) 8
Hash(Thread, Crystal::Event) 8
Crystal::SpinLock 19

(notice the ThreadLocalValue is gone, because it’s a struct)

That’s a bit more useful, but where are all those spinlocks being created?

So I came up with a tracing that has two modes. In the default mode it will show the above output. Then you can specify a target, in this case Crystal::SpinLock and it will tell you the most frequent callstacks where this happens.

Here’s the code for that:

module Tracing
  class_property(target : String) do
    ENV["TRACE_TARGET"]? || ""
  end

  BACKTRACE_LIMIT = ENV["TRACE_BACKTRACE_LIMIT"]?.try(&.to_i) || 6
  MIN_COUNT       = ENV["TRACE_MIN_COUNT"]?.try(&.to_i)

  class_property? running = false
  class_property(stats : Hash(String, Int32)) do
    Hash(String, Int32).new(0)
  end
  class_property(callstacks : Array(Exception::CallStack)) do
    Array(Exception::CallStack).new
  end

  at_exit do
    Tracing.running = true

    if Tracing.target.empty?
      Tracing.show_stats
    else
      Tracing.show_callstacks
    end
  end

  def self.show_callstacks
    puts "Done!"
    puts "All callstacks: #{Tracing.callstacks.size}"
    puts "Gathering callstack..."

    callstack_to_backtrace = Hash(Exception::CallStack, Array(String)).new do |h, k|
      h[k] = k.printable_backtrace(Tracing::BACKTRACE_LIMIT)
    end

    backtraces = Tracing.callstacks.map_with_index do |callstack, i|
      LibC.printf("%d\n", i) if i.divisible_by?(10_000)
      callstack_to_backtrace[callstack]
    end

    sorted_backtraces =
      backtraces
        .tally
        .to_a
        .sort_by { |backtrace, count| -count }

    min_count = Tracing::MIN_COUNT
    if min_count
      sorted_backtraces.reject! do |backtrace, count|
        count < min_count
      end
    end

    sorted_backtraces
      .each do |backtrace, count|
        puts "#{count} times:"
        puts "=" * 40
        backtrace.each do |frame|
          puts frame
        end
      end
  end

  def self.show_stats
    Tracing.stats.to_a.sort_by(&.[1]).each do |k, v|
      puts "#{k}: #{v}"
    end
  end
end

def __trace_new(type : T.class) forall T
  return if Tracing.running?

  Tracing.running = true

  target = Tracing.target
  if target.empty?
    {% if T.class? %}
      Tracing.stats[T.name] += 1
    {% end %}
  else
    if T.name == target
      Tracing.callstacks << Exception::CallStack.new
    end
  end

  Tracing.running = false
end

Now we can compile that program. Let’s call it bar.cr:

crystal build bar.cr

Now if we run it we get the same output as before.

But then we can run it like this:

TRACE_TARGET=Crystal::SpinLock ./bar

And we get this output:

Done!
All callstacks: 23
Gathering callstack...
0
2 times:
========================================
bar.cr:83:29 in '__trace_new'
src/primitives.cr:36:1 in 'new'
src/crystal/thread_local_value.cr:4:12 in 'new'
src/io/evented.cr:12:14 in 'new'
src/crystal/system/unix/file_descriptor.cr:187:5 in 'from_stdio'
src/io/file_descriptor.cr:35:5 in 'from_stdio'
2 times:
========================================
bar.cr:83:29 in '__trace_new'
src/primitives.cr:36:1 in 'new'
src/crystal/thread_local_value.cr:4:12 in 'new'
src/io/evented.cr:13:14 in 'new'
src/crystal/system/unix/file_descriptor.cr:187:5 in 'from_stdio'
src/io/file_descriptor.cr:35:5 in 'from_stdio'
2 times:
========================================
bar.cr:83:29 in '__trace_new'
src/primitives.cr:36:1 in 'new'
src/crystal/thread_local_value.cr:4:12 in 'new'
src/io/evented.cr:15:17 in 'new'
src/crystal/system/unix/file_descriptor.cr:187:5 in 'from_stdio'
src/io/file_descriptor.cr:35:5 in 'from_stdio'
(...and a lot more...)

You can control how many callstack frames to take into account. The default is 6 but maybe that’s too much. Let’s change it to 3.

$ TRACE_TARGET=Crystal::SpinLock TRACE_BACKTRACE_LIMIT=3 ./bar
Done!
All callstacks: 23
Gathering callstack...
0
20 times:
========================================
bar.cr:83:29 in '__trace_new'
src/primitives.cr:36:1 in 'new'
src/crystal/thread_local_value.cr:4:12 in 'new'
2 times:
========================================
bar.cr:83:29 in '__trace_new'
src/primitives.cr:36:1 in 'new'
src/mutex.cr:24:11 in 'new'
1 times:
========================================
bar.cr:83:29 in '__trace_new'
src/primitives.cr:36:1 in 'new'
src/crystal/scheduler.cr:73:11 in 'new'

Now we can quickly see that the most usual allocation of it happens in ThreadLocalValue.

The nicest thing about this is that how memory is tracked and profiled is entirely up to the user. Maybe the standard library can provide a default one, like the above. But it’s completely customizable, and entirely in Crystal.

I’m actually starting to use that to profile the compiler, to reduce memory allocations (it seems we are allocating a lot of arrays of ASTNode!)

Do you think this can be helpful to have? Did you wanted something like this in the past?

20 Likes

This is amazing! Many times I needed to try and use perf and callgrind but there is no good trace to show where it was called. with this it seems like a very easy way to find and reduce overly generated objects! this is very very cool!

1 Like

I am a bit hesitant about making this a regular call. For this particular kind of profiling we should only ever need to call __trace_new from the allocate primitive.

It’s a great observation! I actually initially wanted to do it for the allocate primitive. The problem is that the primitive would have to somehow do the call, typechecking it. So maybe the allocate method that’s on every object should call the primtiive and something else before that. But at that point I realized, for the proof of concept, I only needed to patch the new method.

That said, maybe new is a good place, because this is called once the object is fully complete. Maybe you’d like to do some stats on specific objects, no idea.

My idea is we could simply inject a fun call __trace_new(T.crystal_instance_type_id) in the codegen phase which avoids much of the type inference, since the compiler code generating the tracing calls should itself be fast. For a tracer to be useful we also need similar funs to to go from a type ID to a C string of the corresponding type, or vice versa. (This idea came up when I was working on the GDB / LLDB pretty-printers, since a union variable also needs this capability to get the correct run-time type.)

3 Likes

This sounds very useful.

Do you think this can be helpful to have?

Yes, at least in some form.

Did you wanted something like this in the past?

Yes. I’ve been playing around with valgrind --tool=dhat which provides something very similar (looking at the output is highly recommended and may give ideas for improvement) . Unfortunately the valgrind tools doesn’t recognize the allocating functions crystal uses, so both the dhat and massifonly see what linked libraries make use of. So when using dhat on the compiler then it only shows a couple of code generation places and a couple of thousand invokations to getcwd (which call malloc down in libc. Being able to find such places in crystal code would be great.

Hmm, i wonder what valgrind actually does and if it is extensible. Or how hard it is to output the same format valgrind does - having a prebuilt tool to navigate tries of allocating stack traces would be pretty nice.

1 Like

Some profilers like perf write out to a file and then process the data later. Is that to keep from messing with original code the profiler is measuring? If so I wonder if that is a pattern we would want in crystal.

2 Likes

Thanks very much, I tried to use massif with a crystal program weeks ago, it was just showing allocations made by libyaml and I had no idea why.

This proposal just add the call to __trace_new on auto-generated new method? Or does it also do that for user written self.new methods?

I think that a call __trace_new (__trace_alloc?) on allocate seems to cover these use cases where the user writes it own self.new to re-use some cache and avoid a call to allocate.

I’d be very happy if someone figured out a way to make valgrind see the allocations.

1 Like

The way to do this is to figure out a way to let Boehm GC use malloc instead of sbrk to expand the heap… I just found an option to let it use mmap, that also doesn’t work with valgrind.

1 Like

Unrelated but would it be possible to do something like this for “all function calls”? Sometimes I yearn for ruby’s -rtracer equivalent… :)

1 Like