How to diagnose occasional Segfault? In Time.monotonic within Timecop.travel

In our CI system, we’ve had a handful of builds die with a segfault (SIGSEGV) while running specs, in particular a section of our specs that use Timecop.travel. So far I can not reproduce this locally, and can’t reproduce it in CI either – it’s just caused a few CI builds to fail and then they succeed when the build is retried. It’s happened in both 1.10.1 and 1.9.2. Any ideas on how could I go about debugging this?

Here’s the relevant output from CI logs:

Invalid memory access (signal 11) at address 0x4
[0x562c5a2ac9c6] *Exception::CallStack::print_backtrace:Nil +118 in /app/.cache/crystal/crystal-run-spec.tmp
[0x562c5a2349d6] ~procProc(Int32, Pointer(LibC::SiginfoT), Pointer(Void), Nil) +310 in /app/.cache/crystal/crystal-run-spec.tmp
[0x7f22cafc1fd0] ?? +139787411136464 in /lib/x86_64-linux-gnu/libc.so.6
[0x562c5a41f951] *Timecop::TimeStackItem#monotonic:Time::Span +17 in /app/.cache/crystal/crystal-run-spec.tmp
[0x562c5a3e16a3] *Time::monotonic:Time::Span +67 in /app/.cache/crystal/crystal-run-spec.tmp
[0x562c5a7221b1] *IdleGC::collect:Nil +33 in /app/.cache/crystal/crystal-run-spec.tmp
[0x562c5a289be5] ~procProc(Nil) +389 in /app/.cache/crystal/crystal-run-spec.tmp
[0x562c5a36bf52] *Fiber#run:(IO::FileDescriptor | Nil) +114 in /app/.cache/crystal/crystal-run-spec.tmp
[0x562c5a2344f6] ~proc2Proc(Fiber, (IO::FileDescriptor | Nil)) +6 in /app/.cache/crystal/crystal-run-spec.tmp
[0x0] ???
Command exited with non-zero status 11
Command being timed: "crystal spec --verbose --stats -Dpreview_mt"

I’ve looked at:

But don’t see any obvious issues. Per the libc docs on clock_gettime, “This function cannot fail.”

I’ll note that the IdleGC.collect call in the stacktrace means this may be happening in a different Fiber than the Timecop.travel call. But still, a segfault seems really surprising.

Apologies, I realize it’s not too useful since I can’t reproduce this, but wondering what could potentially cause a segfault. Thanks for your help.

1 Like

If it’s relevant, the IdleGC.collect method looks like this:

def self.collect : Nil
  mu.synchronize do
    start_time = Time.monotonic
    GC.collect
    end_time = Time.monotonic

    @@last_checked_at = start_time
    @@last_collected_at = start_time
    @@last_collected_duration = end_time - start_time
  end
end

And Crystal::System::Time.monotonic basically boils down to: LibC.clock_gettime(LibC::CLOCK_MONOTONIC, out tp)

The only way I can think of a segfault happening here is if somehow the address for out tp on the heap is considered invalid. (Could that be related to adjacent garbage collection happening somehow?)

But have not experienced this in production, can not reproduce, and have only seen in CI, and only in specs where Timecop.travel was involved.

out tp in Crystal::System::Time.monotonic is a local variable on the stack, not heap allocated. So there shouldn’t be any interaction with GC.
In fact, Time.monotonic - at least the stdlib implementation - is completely allocation-free.

2 Likes

@straight-shoota OK, thanks, that’s very interesting to know. I had (incorrectly) assumed that there’d be a pointer on the stack, but it would be pointing to the struct timespec itself on the heap.

This makes the segfault even more strange…

It sounds like probably caused by some invalid code (e.g. type conflict etc)?

Timecop.travel just the reason for ignore/run some code snippets?

Anyway, this is a idea.