I think I found a bug, but it deals in some areas I’m not super familiar with (namely, anything to do with LLVM) so I want to get a quick check before I open an issue.
I’m implementing a Redis-like database in Crystal and decided to try it out with -D preview_mt just to check out how well it worked because a multithreaded Redis-workalike would be pretty great. There were some slightly incorrect results and mutex deadlocks once I went beyond 500 or so fibers. Deadlocks were far more prevalent with --release — I don’t think I saw a single completed run with it enabled.
I tweaked my locking implementation for a while trying to figure out what was causing it (I thought it was on my end) and noticed some weird behavior. I noticed that if I had a Fiber.yield before the Mutex#synchronize call, sometimes it would work just fine (I tried this because I noticed logging would sometimes clear it up) and other times it would fail with this error message. That’s when dove into the Mutex code and noticed that Instrinsics.pause is a no-op on non-Intel architectures:
I tried it on my Intel iMac and it worked just fine, even well beyond 500 fibers, so it seems Intrinsics.pause is pretty critical to correctness in a multithreaded app. I checked LLVM and apparently it’s a builtin for x86 but that call doesn’t exist for AArch64 (the Apple M1). I can’t tell if there’s anything comparable.
I found a YIELD instruction in the ARM code but it doesn’t seem to do the trick. This was the monkeypatch I added to my app to get it to use that instruction:
# Monkeypatches for ARM
lib LibIntrinsics
{% if flag?(:arm) || flag?(:aarch64) %}
# https://github.com/llvm/llvm-project/blob/e356027016c6365b3d8924f54c33e2c63d931492/llvm/test/CodeGen/AArch64/hints.ll#L14-L21
enum ARMHint : Int32
YIELD = 1
end
{% if flag?(:arm) %}
fun arm_hint = "llvm.arm.hint"(hint : ARMHint)
{% elsif flag?(:aarch64) %}
fun arm_hint = "llvm.aarch64.hint"(hint : ARMHint)
{% end %}
{% end %}
end
module Intrinsics
def self.pause
{% if flag?(:i386) || flag?(:x86_64) %}
LibIntrinsics.pause
{% elsif flag?(:arm) || flag?(:aarch64) %}
LibIntrinsics.arm_hint(LibIntrinsics::ARMHint::YIELD)
{% end %}
end
end
With this in place I’m still seeing deadlocks and incorrect results. Anyone more familiar with LLVM and/or ARM know what we could do here?
so it seems Intrinsics.pause is pretty critical to correctness in a multithreaded app
That doesn’t make sense. What pause does is that it tells the OS scheduler that “hey, it is time to let something else run for a while”. But as operative systems nowadays will eventually preempt the thread anyhow, it should be only a performance optimization. It seems likely that you have found a race condition (either in your own code or in crystal core) and that the race condition won’t show itself if the rescheduling happens. But that doesn’t mean pause is the culprit, it means something else has faulty logic.
After hours of experimentation, reading LLVM source, and inundating my browser with more tabs than I’d like to admit yesterday, that’s about where I’m at, too.
My locking code looks like this:
class LockPool
# Should we use a pool to reduce GC pressure?
# @pool = Set(Lock).new
@locks = Hash(String, Lock).new
# We need our own lock so that multiple lock requests for the same key don't
# create multiple locks or delete one when another one began waiting for it.
@lock = Mutex.new
def initialize
end
# Acquire a lock for the given key.
def lock(key : String)
fiber = Fiber.current
lock = lock do
@locks.fetch(key) do
@locks[key] = Lock.new
end
end
begin
lock.waiting!
lock.synchronize { yield }
ensure
lock do
if lock.done! <= 0
@locks.delete key
end
end
end
end
private def lock
@lock.synchronize { yield }
end
# A lock that exposes how many fibers are waiting on it.
private class Lock
@waiting = Atomic(Int32).new(0)
@mutex = Mutex.new
def waiting!
@waiting.add(1) + 1
end
def synchronize
@mutex.synchronize { yield }
end
def done!
@waiting.sub(1) - 1
end
end
end
Locks have key-level granularity, so when you need to lock a key, it takes an administrative lock to see if there is a lock already allocated for that key. If not, it allocates one and assigns it, then releases the administrative lock. Then, when no more fibers are attempting to acquire that lock, it deletes it (or returns it to the pool, but I haven’t implemented that yet).
The idea is to wrap this around commands like INCR so that if the key doesn’t exist it will create it, basically doing what the LockPool does for its own keys. Simplified code for performing the increment:
lock_pool.lock key do
integer = data.fetch(key) { Integer.new(0_i64) }
integer.increment
end
Putting on my SRE hat here: use care when throwing around words like “the culprit” or “faulty logic”. Failure in software rarely, if ever, has a single root cause. There is instead a combination of factors — the “root cause” won’t result in failure if certain other factors aren’t there. With this in mind, I want to be clear that I did not mean to imply that lack of PAUSE was the root cause, only suggest that it seems like a necessary component of the overall concurrency implementation in Crystal when multithreading is in play.
I think I see a problem. Suppose you have simultaneous threads, in positions A, B, C, respectively:
def lock(key : String)
fiber = Fiber.current
C
lock = lock do
@locks.fetch(key) do
@locks[key] = Lock.new
end
end
B
begin
lock.waiting!
lock.synchronize { yield }
ensure
lock do
A
if lock.done! <= 0
@locks.delete key
end
end
end
end
Current lock count is 1. All use the same key.
B has just exited the lock, and A has picked it up. A happens to execute first and will go on and delete the key in @locks. Then B will proceed with the original lock, while C will create a totally new one. Then C will go on and pick up the newly created lock and then access the locked section simultaneously as B.
Ahh, nice catch! For the same reason I put lock.done! inside the administrative lock, it looks like I need to do the same with lock.waiting!. I’ll give this a shot today. Thanks!
Wild that this never showed up on my Intel Mac, even with --release.
Nice that it has been fixed. Either way, this could be a relevant performance optimization for ARM. If possible, consider contributing your YIELD implementation to the stdlib :)