Segfault on Linux

stakach · May 24, 2021, 2:39pm

have an annoying issue occurring on Linux but working fine on MacOS
spider-gazelle/tasker@f7e462c

Invalid memory access (signal 11) at address 0x0
[0x561a7c4a6b36] *Exception::CallStack::print_backtrace:(Int32 | Nil) +118
[0x561a7c49397c] __crystal_sigfault_handler +316
[0x7efcceea8980] ???
[0x0] ???

That’s using the latest ubuntu docker image
I’ve isolated the issue to this spec:
tasker/tasker_spec.cr

bcardiff · May 24, 2021, 7:26pm

I would check if setting CRYSTAL_LOAD_DWARF=1 or CRYSTAL_LOAD_DWARF=0 leads to a different output. That would narrow/discard if the problem itself is on the dwarf loading which is different on linux and darwin.

stakach · May 25, 2021, 2:24am

@bcardiff the dwarf loading made no difference

I used GDB and tracked the segfault to this line:

Stepping through on GDB I see:

stakach · May 25, 2021, 3:00am

Grabbed this just before the seg fault

break point set for the break on line 149
called step and it jumped into the else statement - but it should have exited the loop

A screen shot of state just after the second step statement

stakach · May 25, 2021, 3:16am

Although when I overloaded the reschedule function with an alternative implementation I’m still seeing the crash.

What’s also weird is that the crash doesn’t consistently happen either… Crashes 2 out of 3 runs - just to make life difficult

stakach · May 25, 2021, 8:38am

I solved it by refactoring away from using Fiber scheduling to using channels
as I’d built the library before select timeout was available

bcardiff · May 25, 2021, 9:49pm

Good! You mean this change right? Getting rid of those sleep is

Maybe the segfault was coming from rescheduling a fiber in an invalid state, so it’s definitely better to use Channel for synchronization.

On that note, maybe you want the cancel channel to have a capacity of 1 (instead of 0) so the #cancel method will never block. I think right now it could in some race condition. The cancel.send will block until a receiver gets the message, but that is not what you want here.

rogerdpack · June 7, 2021, 9:48pm

Is this multi-threaded? Any simplified code to repro it?

stakach · June 8, 2021, 9:55pm

not multi-threaded and no simplified code…
it was pretty hacky fiber manipulation - waking fibers from sleep early, which might have left some dangling pointers somewhere?

This was the fiber timer

github.com

spider-gazelle/tasker/blob/e7f4af0dcfe6488af671aabdfcf7cedce70e95a1/src/tasker/timer.cr#L16

    
      
                sleep sleep_for
                block.call unless @cancelled
              end
            end
          
          
  def start_timer
              Fiber.current.enqueue
              @fiber.resume
            end
          
          
  def cancel
              @cancelled = true
              if !@fiber.dead? && @fiber.resumable?
                Fiber.current.enqueue
                @fiber.resume
              end
            end
          end

Effectively:

sleep for the timer period and then perform an action if not cancelled
cancelling the timer involved flagging that it was cancelled and then waking it up early to clean up memory

drhuffman12 · June 9, 2021, 4:46am

I’m not sure and I could be totally off, but would a mutex lock around @cancelled help?

stakach · June 9, 2021, 8:13am

Shouldn’t be required as running on the one thread, so most likely unrelated.
The issue only occurred on Linux and not every run, so probably going to be a tough one to track down

rogerdpack · June 11, 2021, 2:09am

Hard to diagnose without a simplified example. Maybe valgrind help? Good luck!

stakach · June 11, 2021, 2:27am

All good, I refactored around it.

Topic		Replies	Views
Debugging segfaults further Help & Support	2	330	May 25, 2019
Compiler release options causing segfaults Help & Support	12	478	March 13, 2020
Issue with crashes/hangs with fibers/channel and PG DB Help & Support	11	556	March 13, 2021
Invalid memory access in Crystal compiler [pre-solved] Help & Support	10	1043	June 17, 2020
Difficult to trace segfault Help & Support	5	483	February 25, 2020

Segfault on Linux

Related topics