Instrumentation patterns in Crystal

For a while I’ve been working on a couple shards for instrumenting Crystal apps: Datadog and also OpenTelemetry for vendor-agnostic instrumentation. With both shards, writing integrations is quite simple — you basically monkey patch the methods and wrap them with TheShardName.trace { previous_def } while setting various span attributes.

The problem is that this leads to pretty frustrating compilation times, at least on macOS (not on Linux for some reason), to the point where I have to disable instrumentation in development with flags like -Ddatadog or -Dotel (and stub the calls for manually instrumented code) to get it back down to a reasonable level. My apps normally take 3-6 seconds to compile when I make a change and I’ve seen the build times go up to a full minute just by loading instrumentation.

What I’m wondering is, is there a better pattern for this that won’t hurt compilation time? I remember @bcardiff once mentioned working on patterns for it to avoid monkeypatching (which is great because what the core team recommends often becomes an ecosystem-wide convention) but I also wanted to see if the community has some ideas for it.

What does the trace method look like?

OpenTelemetry.trace

I don’t think it’s the trace method itself. It’s monomorphic so once it’s compiled it’s done, and manual instrumentation doesn’t seem to exhibit the same effect on compilation times. But the integrations override existing methods like DB::ResultSet#each and call previous_def. It seems to be primarily based on whether or not I load those integrations into my app.

FWIW I’m pretty sure the best practices say that instrumenting specific methods like this isn’t usually needed/recommended. So maybe the solution is to just not do it to every method?

What do you mean “specific methods like this”?

What makes you think this is what I’m doing?

This doesn’t apply for methods that yield.

Remember that a method that has yield in it is essentially inlined in the caller every time it’s called. If the method that has yield has a lot of code in it, there are two implications:

  • the executable will get larger
  • the compiler has to type the method that has yield over and over again

I see that both OpenTelemetry.trace and the redefined DB::ResultSet#each are huge!

They look like this now:

def your_method
  # 100 lines of code
  yield
  # 100 lines of code
end

(of course in your case it’s not 100 lines of code, but it’s still a lot)

Consider refactoring your code to be like this:

def your_method
  helper_method1
  yield
  helper_method2
end

def helper_method1
  # 100 lines of code
end

def helper_method2
  # 100 lines of code
end

That way, yes, the method with yield will still have to be typed over and over again, but it’s just typing the methods helper_method1 and helper_method2 which don’t yield, so these are effectively only analyzed once.

I also have a small test to prove this. Take a look at this code:

def bar(x)
  x + x
end

def foo
  {% for i in 0..10_000 %}
    bar({{i}})
  {% end %}
  yield
  {% for i in 0..10_000 %}
    bar({{i}})
  {% end %}
end

{% for i in 0..100 %}
  foo do
    puts "Hello"
  end
{% end %}

That’s essentially your version: a method that has yield and involves a lot of code before and after yield. It takes the compiler 5 seconds to do semantic analysis of that.

Now if we refactor it to be like this:

def bar(x)
  x + x
end

def baz
  {% for i in 0..10_000 %}
    bar({{i}})
  {% end %}
end

def foo
  baz
  yield
  baz
end

{% for i in 0..100 %}
  foo do
    puts "Hello"
  end
{% end %}

It takes 0.14s to analyze it.

Now, ideally, we shouldn’t worry about these things. With modular compilation we wouldn’t need to analyze the entire program each time. But I’m not sure that goal is achievable.

On the other hand… this kind of forces you to have cleaner code. Instead of defining a method with 100 lines of code, you will end up with smaller, more composable methods. I don’t know :sweat_smile:

6 Likes

@jgaskins I’m also curious where most of the time in spent during compilation. If you compile it with -s, what’s the output? Then we’ll see if the semantic phase is taking a long time, or maybe it’s just the codegen phase that has to generate a lot of duplicate code.

1 Like

Whoa, TIL. I guess now that you mention it, that makes a lot of sense since yield inlines the block. I hadn’t considered that. Would capturing the block and using block.call remove the need to compile it repeatedly? I’m assuming block.call ends up being compiled similarly to a more typical function call with a completely new stack frame and all.

I’ll give this a shot.

I’ll have to play with this a bit in a real dev feedback loop. Trying to re-create it synthetically keeps giving me this, which isn’t a solid benchmark:

Codegen (bc+obj):
 - all previous .o files were reused

Right, that’s correct.

I’ll have to play with this a bit in a real dev feedback loop. Trying to re-create it synthetically keeps giving me this, which isn’t a solid benchmark

I think you’ll need to empty the cache to try these things. Like rm -rf $(crystal env CRYSTAL_CACHE_DIR) (well, better check what that expands too first :sweat_smile:)

I did try removing the cached artifacts, but that was also not representative of my typical workflow. :slightly_smiling_face: I recompile my app on every change, so a lot of the time most .o files can be reused. I use -s in my build process so it isn’t quite so opaque.

1 Like