Problem statement
We are dealing with fairly complex application spawning fibers on demand with a fork-join pattern.
We’d expect that newly created fibers run their task and then shut down.
We’d find it useful to look at the fiber count over time under test load to make sure that the number of fibers is not growing unbounded. Spotting such pattern would allow us to investigate the issue in a timely fashion, rather than having to wait for an Out Of Memory error to hit production.
Partial solutions
My first take on this was
def fiber_count
i = 0
Fiber.unsafe_each { i += 1}
i
end
Now, mind that we are relying on an undocumented bit of the Fiber API, but we’re aware of that, and understand the risk of this changing in future versions of Crystal. Also, we’re not modifying the content of the collection underlying Fiber.unsafe_each
, so we should not be breaking anything.
AFAIK, this does the trick on a single threaded Crystal application.
@j8r raised that this does not produce the total number of fibers in multi-threaded applications. This is because as of Crystal 0.35.1, fibers are assigned to a thread and Fiber.unsafe_each
will only iterate over the fibers assigned to a single thread (which one? The one the current fiber is assigned to).
What does this mean? It means we’d need to iterate over all the threads via Threads.unsafe_each
and then figure out a way of listing the fibers assigned to each. Now, mind that at this point we’re dealing with a piece of code that is
- hidden away from users by design, i.e. no docs
- OS dependent
Even ignoring the above, the Thread API doesn’t give a straightforward thread-safe way to enumerate its fibers - which means we could try to collect the fibers assigned to a thread, but we might miss some in the process due to race condition. Or, to put it in the words of @yxhuvud
[…] it is not a thread safe interface - not only is the way you list them inherently racy, there is also that runnables is a normal deque and unsafe - ie it not intended to be accessed from outside the scheduler.
Closing considerations
Even with a hacker hat on, it seems unlikely that we can get a proper fiber count out of a Crystal app when the preview_mt
flag is enabled. This is not something we didn’t know before - we’ve seen posts on this sort of things before - but a good reminder that to be able to monitor resources like fibers in complex, multithreaded applications would be a big win for some of us, and hopefully something to be considered for inclusions in v1.0.0.