The Crystal Programming Language Forum

Segfault on Linux

have an annoying issue occurring on Linux but working fine on MacOS
spider-gazelle/tasker@f7e462c

Invalid memory access (signal 11) at address 0x0
[0x561a7c4a6b36] *Exception::CallStack::print_backtrace:(Int32 | Nil) +118
[0x561a7c49397c] __crystal_sigfault_handler +316
[0x7efcceea8980] ???
[0x0] ???

That’s using the latest ubuntu docker image
I’ve isolated the issue to this spec:
tasker/tasker_spec.cr

I would check if setting CRYSTAL_LOAD_DWARF=1 or CRYSTAL_LOAD_DWARF=0 leads to a different output. That would narrow/discard if the problem itself is on the dwarf loading which is different on linux and darwin.

1 Like

@bcardiff the dwarf loading made no difference

I used GDB and tracked the segfault to this line:

Stepping through on GDB I see:

1 Like

Grabbed this just before the seg fault

  1. break point set for the break on line 149
  2. called step and it jumped into the else statement - but it should have exited the loop

A screen shot of state just after the second step statement

Although when I overloaded the reschedule function with an alternative implementation I’m still seeing the crash.

What’s also weird is that the crash doesn’t consistently happen either… Crashes 2 out of 3 runs - just to make life difficult

I solved it by refactoring away from using Fiber scheduling to using channels
as I’d built the library before select timeout was available

2 Likes

Good! You mean this change right? Getting rid of those sleep is :rocket:

Maybe the segfault was coming from rescheduling a fiber in an invalid state, so it’s definitely better to use Channel for synchronization.

On that note, maybe you want the cancel channel to have a capacity of 1 (instead of 0) so the #cancel method will never block. I think right now it could in some race condition. The cancel.send will block until a receiver gets the message, but that is not what you want here.

3 Likes

Is this multi-threaded? Any simplified code to repro it?

not multi-threaded and no simplified code…
it was pretty hacky fiber manipulation - waking fibers from sleep early, which might have left some dangling pointers somewhere?

This was the fiber timer

Effectively:

  • sleep for the timer period and then perform an action if not cancelled
  • cancelling the timer involved flagging that it was cancelled and then waking it up early to clean up memory

I’m not sure and I could be totally off, but would a mutex lock around @cancelled help?

Shouldn’t be required as running on the one thread, so most likely unrelated.
The issue only occurred on Linux and not every run, so probably going to be a tough one to track down

Hard to diagnose without a simplified example. Maybe valgrind help? Good luck!

All good, I refactored around it.