How to diagnose occasional Segfault? In Time.monotonic within Timecop.travel

compumike · October 18, 2023, 8:45pm

In our CI system, we’ve had a handful of builds die with a segfault (SIGSEGV) while running specs, in particular a section of our specs that use Timecop.travel. So far I can not reproduce this locally, and can’t reproduce it in CI either – it’s just caused a few CI builds to fail and then they succeed when the build is retried. It’s happened in both 1.10.1 and 1.9.2. Any ideas on how could I go about debugging this?

Here’s the relevant output from CI logs:

Invalid memory access (signal 11) at address 0x4
[0x562c5a2ac9c6] *Exception::CallStack::print_backtrace:Nil +118 in /app/.cache/crystal/crystal-run-spec.tmp
[0x562c5a2349d6] ~procProc(Int32, Pointer(LibC::SiginfoT), Pointer(Void), Nil) +310 in /app/.cache/crystal/crystal-run-spec.tmp
[0x7f22cafc1fd0] ?? +139787411136464 in /lib/x86_64-linux-gnu/libc.so.6
[0x562c5a41f951] *Timecop::TimeStackItem#monotonic:Time::Span +17 in /app/.cache/crystal/crystal-run-spec.tmp
[0x562c5a3e16a3] *Time::monotonic:Time::Span +67 in /app/.cache/crystal/crystal-run-spec.tmp
[0x562c5a7221b1] *IdleGC::collect:Nil +33 in /app/.cache/crystal/crystal-run-spec.tmp
[0x562c5a289be5] ~procProc(Nil) +389 in /app/.cache/crystal/crystal-run-spec.tmp
[0x562c5a36bf52] *Fiber#run:(IO::FileDescriptor | Nil) +114 in /app/.cache/crystal/crystal-run-spec.tmp
[0x562c5a2344f6] ~proc2Proc(Fiber, (IO::FileDescriptor | Nil)) +6 in /app/.cache/crystal/crystal-run-spec.tmp
[0x0] ???
Command exited with non-zero status 11
Command being timed: "crystal spec --verbose --stats -Dpreview_mt"

I’ve looked at:

Time.monotonic
Crystal::System::Time.monotonic – which is where the actual LibC.clock_gettime(LIBC::CLOCK_MONOTONIC, ...) syscall happens
Timecop::TimeStackItem#monotonic

But don’t see any obvious issues. Per the libc docs on clock_gettime, “This function cannot fail.”

I’ll note that the IdleGC.collect call in the stacktrace means this may be happening in a different Fiber than the Timecop.travel call. But still, a segfault seems really surprising.

Apologies, I realize it’s not too useful since I can’t reproduce this, but wondering what could potentially cause a segfault. Thanks for your help.

compumike · October 18, 2023, 9:01pm

If it’s relevant, the IdleGC.collect method looks like this:

def self.collect : Nil
  mu.synchronize do
    start_time = Time.monotonic
    GC.collect
    end_time = Time.monotonic

    @@last_checked_at = start_time
    @@last_collected_at = start_time
    @@last_collected_duration = end_time - start_time
  end
end

And Crystal::System::Time.monotonic basically boils down to: LibC.clock_gettime(LibC::CLOCK_MONOTONIC, out tp)

The only way I can think of a segfault happening here is if somehow the address for out tp on the heap is considered invalid. (Could that be related to adjacent garbage collection happening somehow?)

But have not experienced this in production, can not reproduce, and have only seen in CI, and only in specs where Timecop.travel was involved.

straight-shoota · October 19, 2023, 9:51am

out tp in Crystal::System::Time.monotonic is a local variable on the stack, not heap allocated. So there shouldn’t be any interaction with GC.
In fact, Time.monotonic - at least the stdlib implementation - is completely allocation-free.

compumike · October 19, 2023, 4:31pm

@straight-shoota OK, thanks, that’s very interesting to know. I had (incorrectly) assumed that there’d be a pointer on the stack, but it would be pointing to the struct timespec itself on the heap.

This makes the segfault even more strange…

zw963 · October 20, 2023, 11:32am

It sounds like probably caused by some invalid code (e.g. type conflict etc)?

Timecop.travel just the reason for ignore/run some code snippets?

Anyway, this is a idea.

Topic		Replies	Views
Debugging segfaults further Help & Support	2	330	May 25, 2019
Segfault on Linux Help & Support	12	624	June 11, 2021
Invalid memory access in Crystal compiler [pre-solved] Help & Support	10	1046	June 17, 2020
Difficult to trace segfault Help & Support	5	483	February 25, 2020
Compiler release options causing segfaults Help & Support	12	478	March 13, 2020

How to diagnose occasional Segfault? In Time.monotonic within Timecop.travel

Related topics