have an annoying issue occurring on Linux but working fine on MacOS
spider-gazelle/tasker@f7e462c
Invalid memory access (signal 11) at address 0x0
[0x561a7c4a6b36] *Exception::CallStack::print_backtrace:(Int32 | Nil) +118
[0x561a7c49397c] __crystal_sigfault_handler +316
[0x7efcceea8980] ???
[0x0] ???
That’s using the latest ubuntu docker image
I’ve isolated the issue to this spec:
tasker/tasker_spec.cr
I would check if setting CRYSTAL_LOAD_DWARF=1
or CRYSTAL_LOAD_DWARF=0
leads to a different output. That would narrow/discard if the problem itself is on the dwarf loading which is different on linux and darwin.
1 Like
@bcardiff the dwarf loading made no difference
I used GDB and tracked the segfault to this line:
Stepping through on GDB I see:
1 Like
Grabbed this just before the seg fault
- break point set for the
break
on line 149
- called
step
and it jumped into the else
statement - but it should have exited the loop
A screen shot of state just after the second step statement
Although when I overloaded the reschedule
function with an alternative implementation I’m still seeing the crash.
What’s also weird is that the crash doesn’t consistently happen either… Crashes 2 out of 3 runs - just to make life difficult
I solved it by refactoring away from using Fiber scheduling to using channels
as I’d built the library before select timeout was available
2 Likes
Good! You mean this change right? Getting rid of those sleep
is 
Maybe the segfault was coming from rescheduling a fiber in an invalid state, so it’s definitely better to use Channel for synchronization.
On that note, maybe you want the cancel channel to have a capacity of 1 (instead of 0) so the #cancel
method will never block. I think right now it could in some race condition. The cancel.send
will block until a receiver gets the message, but that is not what you want here.
3 Likes
Is this multi-threaded? Any simplified code to repro it?
not multi-threaded and no simplified code…
it was pretty hacky fiber manipulation - waking fibers from sleep early, which might have left some dangling pointers somewhere?
This was the fiber timer
Effectively:
- sleep for the timer period and then perform an action if not cancelled
- cancelling the timer involved flagging that it was cancelled and then waking it up early to clean up memory
I’m not sure and I could be totally off, but would a mutex lock around @cancelled
help?
Shouldn’t be required as running on the one thread, so most likely unrelated.
The issue only occurred on Linux and not every run, so probably going to be a tough one to track down
Hard to diagnose without a simplified example. Maybe valgrind help? Good luck!
All good, I refactored around it.