To optimize a program’s memory usage, which might also impact its speed (allocating memory is slow compared to other things) I had this idea: what if there’s a callback of some sort any time an object is created? Then we can build code around that callback to gather stats.
Here’s a diff to the compiler’s source code that enables that:
--- a/src/compiler/crystal/semantic/new.cr
+++ b/src/compiler/crystal/semantic/new.cr
@@ -215,6 +215,9 @@ module Crystal
exps << init
exps << If.new(RespondsTo.new(obj.clone, "finalize").at(self),
Call.new(Path.global("GC").at(self), "add_finalizer", obj.clone).at(self))
+ {% if env("TRACE_NEW") == "1" %}
+ exps << Call.new(nil, "__trace_new", args: [Var.new("self").at(self)] of ASTNode, global
+ {% end %}
exps << obj
# Forward block argument if any
@@ -244,6 +247,9 @@ module Crystal
exps = Array(ASTNode).new(3)
exps << assign
exps << If.new(RespondsTo.new(var.clone, "finalize").at(loc), call).at(loc)
+ {% if env("TRACE_NEW") == "1" %}
+ exps << Call.new(nil, "__trace_new", args: [Var.new("self").at(loc)] of ASTNode, global:
+ {% end %}
exps << var.clone
a_def = Def.new("new", body: exps).at(loc)
That’s the simplest thing I came up with, but it could be a compiler flag or something.
With that, every initialize
method that expands into a new
class method will have a call to a global __trace_new
method (this could be done differently.)
Of course that __trace_new
method doesn’t exist, but we can create it!
Let’s try it…
def __trace_new(type : T.class) forall T
LibC.printf("%s\n", T.name)
end
puts "Hello world!"
Now if you run that program with a compiler build with the above patch (and compiled with TRACE_NEW=1) you’ll see this output:
Array(Pointer(Bool))
Crystal::OnceState
Array(String)
Path
Iterator::Stop
Deque(Pointer(Void))
Thread::Mutex
Fiber::StackPool
Slice(T)
Slice(T)
Slice(UInt8)
(... and a lot more)
Those are all the objects being created!
But that’s not very useful as an output.
We can put the names into a hash and count them!
module Tracing
class_property? running = false
class_property(stats : Hash(String, Int32)) do
Hash(String, Int32).new(0)
end
at_exit do
Tracing.stats.to_a.sort_by(&.[1]).each do |type, count|
puts "# #{type} #{count}"
end
end
end
def __trace_new(type : T.class) forall T
return if Tracing.running?
Tracing.running = true
Tracing.stats[T.name] += 1
Tracing.running = false
end
Now we get this output (I’ll show just the last bits):
Hash(Thread, Deque(Fiber)) 8
Crystal::ThreadLocalValue(Deque(Fiber)) 8
Hash(Thread, Crystal::Event) 8
Crystal::ThreadLocalValue(Crystal::Event) 8
Crystal::SpinLock 19
I had no idea the empty Crystal program created 19 spin locks!
Well, tracing memory allocation of structs doesn’t make sense, so we can stick a {% if T.class? %}
there. Here’s the result:
IO::FileDescriptor 4
Hash(Thread, Deque(Fiber)) 8
Hash(Thread, Crystal::Event) 8
Crystal::SpinLock 19
(notice the ThreadLocalValue is gone, because it’s a struct)
That’s a bit more useful, but where are all those spinlocks being created?
So I came up with a tracing that has two modes. In the default mode it will show the above output. Then you can specify a target, in this case Crystal::SpinLock
and it will tell you the most frequent callstacks where this happens.
Here’s the code for that:
module Tracing
class_property(target : String) do
ENV["TRACE_TARGET"]? || ""
end
BACKTRACE_LIMIT = ENV["TRACE_BACKTRACE_LIMIT"]?.try(&.to_i) || 6
MIN_COUNT = ENV["TRACE_MIN_COUNT"]?.try(&.to_i)
class_property? running = false
class_property(stats : Hash(String, Int32)) do
Hash(String, Int32).new(0)
end
class_property(callstacks : Array(Exception::CallStack)) do
Array(Exception::CallStack).new
end
at_exit do
Tracing.running = true
if Tracing.target.empty?
Tracing.show_stats
else
Tracing.show_callstacks
end
end
def self.show_callstacks
puts "Done!"
puts "All callstacks: #{Tracing.callstacks.size}"
puts "Gathering callstack..."
callstack_to_backtrace = Hash(Exception::CallStack, Array(String)).new do |h, k|
h[k] = k.printable_backtrace(Tracing::BACKTRACE_LIMIT)
end
backtraces = Tracing.callstacks.map_with_index do |callstack, i|
LibC.printf("%d\n", i) if i.divisible_by?(10_000)
callstack_to_backtrace[callstack]
end
sorted_backtraces =
backtraces
.tally
.to_a
.sort_by { |backtrace, count| -count }
min_count = Tracing::MIN_COUNT
if min_count
sorted_backtraces.reject! do |backtrace, count|
count < min_count
end
end
sorted_backtraces
.each do |backtrace, count|
puts "#{count} times:"
puts "=" * 40
backtrace.each do |frame|
puts frame
end
end
end
def self.show_stats
Tracing.stats.to_a.sort_by(&.[1]).each do |k, v|
puts "#{k}: #{v}"
end
end
end
def __trace_new(type : T.class) forall T
return if Tracing.running?
Tracing.running = true
target = Tracing.target
if target.empty?
{% if T.class? %}
Tracing.stats[T.name] += 1
{% end %}
else
if T.name == target
Tracing.callstacks << Exception::CallStack.new
end
end
Tracing.running = false
end
Now we can compile that program. Let’s call it bar.cr
:
crystal build bar.cr
Now if we run it we get the same output as before.
But then we can run it like this:
TRACE_TARGET=Crystal::SpinLock ./bar
And we get this output:
Done!
All callstacks: 23
Gathering callstack...
0
2 times:
========================================
bar.cr:83:29 in '__trace_new'
src/primitives.cr:36:1 in 'new'
src/crystal/thread_local_value.cr:4:12 in 'new'
src/io/evented.cr:12:14 in 'new'
src/crystal/system/unix/file_descriptor.cr:187:5 in 'from_stdio'
src/io/file_descriptor.cr:35:5 in 'from_stdio'
2 times:
========================================
bar.cr:83:29 in '__trace_new'
src/primitives.cr:36:1 in 'new'
src/crystal/thread_local_value.cr:4:12 in 'new'
src/io/evented.cr:13:14 in 'new'
src/crystal/system/unix/file_descriptor.cr:187:5 in 'from_stdio'
src/io/file_descriptor.cr:35:5 in 'from_stdio'
2 times:
========================================
bar.cr:83:29 in '__trace_new'
src/primitives.cr:36:1 in 'new'
src/crystal/thread_local_value.cr:4:12 in 'new'
src/io/evented.cr:15:17 in 'new'
src/crystal/system/unix/file_descriptor.cr:187:5 in 'from_stdio'
src/io/file_descriptor.cr:35:5 in 'from_stdio'
(...and a lot more...)
You can control how many callstack frames to take into account. The default is 6 but maybe that’s too much. Let’s change it to 3.
$ TRACE_TARGET=Crystal::SpinLock TRACE_BACKTRACE_LIMIT=3 ./bar
Done!
All callstacks: 23
Gathering callstack...
0
20 times:
========================================
bar.cr:83:29 in '__trace_new'
src/primitives.cr:36:1 in 'new'
src/crystal/thread_local_value.cr:4:12 in 'new'
2 times:
========================================
bar.cr:83:29 in '__trace_new'
src/primitives.cr:36:1 in 'new'
src/mutex.cr:24:11 in 'new'
1 times:
========================================
bar.cr:83:29 in '__trace_new'
src/primitives.cr:36:1 in 'new'
src/crystal/scheduler.cr:73:11 in 'new'
Now we can quickly see that the most usual allocation of it happens in ThreadLocalValue.
The nicest thing about this is that how memory is tracked and profiled is entirely up to the user. Maybe the standard library can provide a default one, like the above. But it’s completely customizable, and entirely in Crystal.
I’m actually starting to use that to profile the compiler, to reduce memory allocations (it seems we are allocating a lot of arrays of ASTNode!)
Do you think this can be helpful to have? Did you wanted something like this in the past?