"signals delivery fails constantly" in multi-threaded Crystal app

Hi Crystal folks – I have a small Crystal 1.8.0 app (on Ubuntu 20.04) that serves some vector tile map data using http/server, the pg shard and crystal-sqlite3. Encoding the vector tile data is somewhat CPU intensive, so I have this running on a 64 core machine with -Dpreview_mt=true and CRYSTAL_WORKERS set to 40.

This generally works pretty well, but, being used in mapping applications, traffic can be quite bursty: users are likely to make a rapid series of requests all within a short period of time as they pan/zoom around the map.

Every now and then the app freezes for a while and then crashes with the error:

Signals delivery fails constantly at GC #1644
Signals delivery fails constantly
Aborted

I run the app in a while true; do ... loop to restart it when it happens, but is there a better way to handle this, or prevent it? From what I can tell, the “Signals delivery fails constantly” error comes from bdwgc (bdwgc/pthread_stop_world.c at 9229da044bbc5f5f131741975c0c35522bed227d · ivmai/bdwgc · GitHub ) but this is a bit over my head as to what to actually do about it. It does seem like there’s a GC_RETRY_SIGNALS environment variable that I can alter to affect how many times (if at all) lost signals are re-sent, but I really have no idea what’s going on here.

Any ideas about what I might consider?

Sounds like a bdwgc issue itself. Refer to Signals delivery fails in gctest on Ubuntu Jammy if compiled with TSan · Issue #543 · ivmai/bdwgc · GitHub and you can try to build bdwgc manually and run gctest to see if you hit the same.

HIH

oh good idea – thanks!