I’m experiencing an issue where crystal spec completes Parse/Semantic and all Codegen stages (bc+obj/linking/dsymutil), but then completely hangs and doesn’t execute the specs.
This issue consistently happens on macOS / Apple Silicon.
Clearing the cache (rm -rf ~/.cache/crystal) does not fix the issue; it still hangs right after the Codegen/linking stage.
CPU Usage: While hanging, both the crystal spec process and the compiled crystal-run-spec.tmp binary sit at 0.0% CPU (State S+). I think.. this indicates a deadlock or a blocking wait rather than an infinite loop.
You can reproduce this by cloning the OWASP Noir repository and running crystal spec.
Environment:
macOS: Tahoe 26.3
Crystal: 1.19.1
Architecture: Apple Silicon (ARM64)
I suspect this might be a macOS-specific LLVM or spec runner issue, possibly a deadlock when the runner tries to execute the compiled spec binary. Has anyone seen this behavior before, or have any pointers on how to debug this further?
I checked out a previous commit from the commit log and ran the specs and they passed, which indicates something in the code.
I used git bisect to find the offending commit based on which ones will run the specs and it points to this commit. I can’t find anything in that commit that would cause this to happen but, sure enough, the specs run just fine at the parent commit.
I also noticed the issue started exactly at that commit. I focused on the newly added specs for the hang analysis, but I haven’t pinpointed the exact cause yet.
From the sound of it the specs start but the program gets stuck waiting on the event loop: 0% CPU, the SIGINT signal is received and handled, meaning that the program isn’t blocked on a syscall, but for some reason a fiber isn’t being resumed.
You can then run lldb (or gdb or another debugger) to run owasp_spec, interrupt the program and see where it hangs… but it probably won’t tell much: stuck on kevent (or epoll_wait) which is “waiting on the evloop”.
The trace.log file will report everything that happened related to the fiber scheduler and the event loop while the program executes.
Other ideas:
Does it reproduce in a Linux VM? It might an issue on AArch64, rather than macOS.
Does it reproduce with -Dpreview_mt -Dexecution_context to enable execution contexts?
with execution contexts, perf-tools allows to print scheduler/fiber details (and there’s pending work that could print the backtrace of non running fibers).
The functional tests create a NoirRunner that runs tech analyzers concurrently using WaitGroup.wait (analyzer.cr:125). WaitGroup internally uses Fiber.suspend, Crystal::SpinLock, and Atomic, and on Apple Silicon’s kqueue-based event loop it seems the fibers just don’t resume properly when combined.
Main fiber goes straight into evloop.run (blocking=1) and stays there — stuck on kevent for ~4.8s+.
Only the stack-pool-collector wakes up occasionally; the spec runner fiber never gets scheduled again.
Trace also showed around 1052 fibers in total when it hangs (way more than when running the dirs separately).
My guess is there’s some interaction between the WaitGroup from http_proxy and the one used in the functional tests, only happens when they get compiled together on Apple Silicon (doesn’t reproduce on Linux, probably epoll vs kqueue difference).
Anyway, that’s what I found so far. Let me know what you think… happy to run more tests if it helps!
Yes, you can’t create a ::WaitGroup type because there’s already one stdlib (for quite some versions now). If it happens to get loaded, then either implemetation is monkey patching the other one, and they conflict. Since it’s a synchronization type, a fiber is almost sure to never resume at some point.
At the very least, the custom implementation should be namespaced as Noir::WaitGroup for example, or the global WaitGroup loaded for older Crystal versions.
What happens if you remove your custom implementation and just use the one from stdlib? It should be a bit more optimized anyway.