Intrinsics.pause no-op on non-Intel CPUs

I think I found a bug, but it deals in some areas I’m not super familiar with (namely, anything to do with LLVM) so I want to get a quick check before I open an issue.

I’m implementing a Redis-like database in Crystal and decided to try it out with -D preview_mt just to check out how well it worked because a multithreaded Redis-workalike would be pretty great. There were some slightly incorrect results and mutex deadlocks once I went beyond 500 or so fibers. Deadlocks were far more prevalent with --release — I don’t think I saw a single completed run with it enabled.

I tweaked my locking implementation for a while trying to figure out what was causing it (I thought it was on my end) and noticed some weird behavior. I noticed that if I had a Fiber.yield before the Mutex#synchronize call, sometimes it would work just fine (I tried this because I noticed logging would sometimes clear it up) and other times it would fail with this error message. That’s when dove into the Mutex code and noticed that Instrinsics.pause is a no-op on non-Intel architectures:

I tried it on my Intel iMac and it worked just fine, even well beyond 500 fibers, so it seems Intrinsics.pause is pretty critical to correctness in a multithreaded app. I checked LLVM and apparently it’s a builtin for x86 but that call doesn’t exist for AArch64 (the Apple M1). I can’t tell if there’s anything comparable.

I found a YIELD instruction in the ARM code but it doesn’t seem to do the trick. This was the monkeypatch I added to my app to get it to use that instruction:

# Monkeypatches for ARM
lib LibIntrinsics
  {% if flag?(:arm) || flag?(:aarch64) %}
    # https://github.com/llvm/llvm-project/blob/e356027016c6365b3d8924f54c33e2c63d931492/llvm/test/CodeGen/AArch64/hints.ll#L14-L21
    enum ARMHint : Int32
      YIELD = 1
    end

    {% if flag?(:arm) %}
      fun arm_hint = "llvm.arm.hint"(hint : ARMHint)
    {% elsif flag?(:aarch64) %}
      fun arm_hint = "llvm.aarch64.hint"(hint : ARMHint)
    {% end %}
  {% end %}
end

module Intrinsics
  def self.pause
    {% if flag?(:i386) || flag?(:x86_64) %}
      LibIntrinsics.pause
    {% elsif flag?(:arm) || flag?(:aarch64) %}
      LibIntrinsics.arm_hint(LibIntrinsics::ARMHint::YIELD)
    {% end %}
  end
end

With this in place I’m still seeing deadlocks and incorrect results. Anyone more familiar with LLVM and/or ARM know what we could do here?

1 Like

so it seems Intrinsics.pause is pretty critical to correctness in a multithreaded app

That doesn’t make sense. What pause does is that it tells the OS scheduler that “hey, it is time to let something else run for a while”. But as operative systems nowadays will eventually preempt the thread anyhow, it should be only a performance optimization. It seems likely that you have found a race condition (either in your own code or in crystal core) and that the race condition won’t show itself if the rescheduling happens. But that doesn’t mean pause is the culprit, it means something else has faulty logic.

After hours of experimentation, reading LLVM source, and inundating my browser with more tabs than I’d like to admit yesterday, that’s about where I’m at, too.

My locking code looks like this:

  class LockPool
    # Should we use a pool to reduce GC pressure?
    # @pool = Set(Lock).new

    @locks = Hash(String, Lock).new

    # We need our own lock so that multiple lock requests for the same key don't
    # create multiple locks or delete one when another one began waiting for it.
    @lock = Mutex.new

    def initialize
    end

    # Acquire a lock for the given key.
    def lock(key : String)
      fiber = Fiber.current

      lock = lock do
        @locks.fetch(key) do
          @locks[key] = Lock.new
        end
      end

      begin
        lock.waiting!
        lock.synchronize { yield }
      ensure
        lock do
          if lock.done! <= 0
            @locks.delete key
          end
        end
      end
    end

    private def lock
      @lock.synchronize { yield }
    end

    # A lock that exposes how many fibers are waiting on it.
    private class Lock
      @waiting = Atomic(Int32).new(0)
      @mutex = Mutex.new

      def waiting!
        @waiting.add(1) + 1
      end

      def synchronize
        @mutex.synchronize { yield }
      end

      def done!
        @waiting.sub(1) - 1
      end
    end
  end

Locks have key-level granularity, so when you need to lock a key, it takes an administrative lock to see if there is a lock already allocated for that key. If not, it allocates one and assigns it, then releases the administrative lock. Then, when no more fibers are attempting to acquire that lock, it deletes it (or returns it to the pool, but I haven’t implemented that yet).

The idea is to wrap this around commands like INCR so that if the key doesn’t exist it will create it, basically doing what the LockPool does for its own keys. Simplified code for performing the increment:

lock_pool.lock key do
  integer = data.fetch(key) { Integer.new(0_i64) }
  integer.increment
end

Putting on my SRE hat here: use care when throwing around words like “the culprit” or “faulty logic”. Failure in software rarely, if ever, has a single root cause. There is instead a combination of factors — the “root cause” won’t result in failure if certain other factors aren’t there. With this in mind, I want to be clear that I did not mean to imply that lack of PAUSE was the root cause, only suggest that it seems like a necessary component of the overall concurrency implementation in Crystal when multithreading is in play.

1 Like

I think I see a problem. Suppose you have simultaneous threads, in positions A, B, C, respectively:

   def lock(key : String)
      fiber = Fiber.current
C
      lock = lock do
        @locks.fetch(key) do
          @locks[key] = Lock.new
        end
      end
B
      begin
        lock.waiting!
        lock.synchronize { yield }
      ensure
        lock do
A
          if lock.done! <= 0
            @locks.delete key
          end
        end
      end
    end

Current lock count is 1. All use the same key.

B has just exited the lock, and A has picked it up. A happens to execute first and will go on and delete the key in @locks. Then B will proceed with the original lock, while C will create a totally new one. Then C will go on and pick up the newly created lock and then access the locked section simultaneously as B.

3 Likes

Ahh, nice catch! For the same reason I put lock.done! inside the administrative lock, it looks like I need to do the same with lock.waiting!. I’ll give this a shot today. Thanks!

Wild that this never showed up on my Intel Mac, even with --release.

Nice that it has been fixed. Either way, this could be a relevant performance optimization for ARM. If possible, consider contributing your YIELD implementation to the stdlib :)

3 Likes

Shout out to @lbguilherme for getting this PR in! :100:

2 Likes