have an annoying issue occurring on Linux but working fine on MacOS
spider-gazelle/tasker@f7e462c
Invalid memory access (signal 11) at address 0x0
[0x561a7c4a6b36] *Exception::CallStack::print_backtrace:(Int32 | Nil) +118
[0x561a7c49397c] __crystal_sigfault_handler +316
[0x7efcceea8980] ???
[0x0] ???
That’s using the latest ubuntu docker image
I’ve isolated the issue to this spec:
tasker/tasker_spec.cr
I would check if setting CRYSTAL_LOAD_DWARF=1 or CRYSTAL_LOAD_DWARF=0 leads to a different output. That would narrow/discard if the problem itself is on the dwarf loading which is different on linux and darwin.
@bcardiff the dwarf loading made no difference
I used GDB and tracked the segfault to this line:
Stepping through on GDB I see:
Grabbed this just before the seg fault
- break point set for the
break on line 149
- called
step and it jumped into the else statement - but it should have exited the loop
A screen shot of state just after the second step statement
Although when I overloaded the reschedule function with an alternative implementation I’m still seeing the crash.
What’s also weird is that the crash doesn’t consistently happen either… Crashes 2 out of 3 runs - just to make life difficult
I solved it by refactoring away from using Fiber scheduling to using channels
as I’d built the library before select timeout was available
Good! You mean this change right? Getting rid of those sleep is 
Maybe the segfault was coming from rescheduling a fiber in an invalid state, so it’s definitely better to use Channel for synchronization.
On that note, maybe you want the cancel channel to have a capacity of 1 (instead of 0) so the #cancel method will never block. I think right now it could in some race condition. The cancel.send will block until a receiver gets the message, but that is not what you want here.
Is this multi-threaded? Any simplified code to repro it?
not multi-threaded and no simplified code…
it was pretty hacky fiber manipulation - waking fibers from sleep early, which might have left some dangling pointers somewhere?
This was the fiber timer
Effectively:
- sleep for the timer period and then perform an action if not cancelled
- cancelling the timer involved flagging that it was cancelled and then waking it up early to clean up memory
I’m not sure and I could be totally off, but would a mutex lock around @cancelled help?
Shouldn’t be required as running on the one thread, so most likely unrelated.
The issue only occurred on Linux and not every run, so probably going to be a tough one to track down
Hard to diagnose without a simplified example. Maybe valgrind help? Good luck!
All good, I refactored around it.