Counting fibers

Problem statement

We are dealing with fairly complex application spawning fibers on demand with a fork-join pattern.
We’d expect that newly created fibers run their task and then shut down.
We’d find it useful to look at the fiber count over time under test load to make sure that the number of fibers is not growing unbounded. Spotting such pattern would allow us to investigate the issue in a timely fashion, rather than having to wait for an Out Of Memory error to hit production.

Partial solutions

My first take on this was

def fiber_count
  i = 0
  Fiber.unsafe_each { i += 1}
  i
end

Now, mind that we are relying on an undocumented bit of the Fiber API, but we’re aware of that, and understand the risk of this changing in future versions of Crystal. Also, we’re not modifying the content of the collection underlying Fiber.unsafe_each, so we should not be breaking anything.

AFAIK, this does the trick on a single threaded Crystal application.

@j8r raised that this does not produce the total number of fibers in multi-threaded applications. This is because as of Crystal 0.35.1, fibers are assigned to a thread and Fiber.unsafe_each will only iterate over the fibers assigned to a single thread (which one? The one the current fiber is assigned to).

What does this mean? It means we’d need to iterate over all the threads via Threads.unsafe_each and then figure out a way of listing the fibers assigned to each. Now, mind that at this point we’re dealing with a piece of code that is

  • hidden away from users by design, i.e. no docs
  • OS dependent

Even ignoring the above, the Thread API doesn’t give a straightforward thread-safe way to enumerate its fibers - which means we could try to collect the fibers assigned to a thread, but we might miss some in the process due to race condition. Or, to put it in the words of @yxhuvud

[…] it is not a thread safe interface - not only is the way you list them inherently racy, there is also that runnables is a normal deque and unsafe - ie it not intended to be accessed from outside the scheduler.

Closing considerations

Even with a hacker hat on, it seems unlikely that we can get a proper fiber count out of a Crystal app when the preview_mt flag is enabled. This is not something we didn’t know before - we’ve seen posts on this sort of things before - but a good reminder that to be able to monitor resources like fibers in complex, multithreaded applications would be a big win for some of us, and hopefully something to be considered for inclusions in v1.0.0.

3 Likes

To be clear, what my comment refer to is trying to access thread.scheduler.@runnables, or in other words, trying to get a list of fiber assignment per core. That is not possible now, as far as I know. Just listing threads in the system should work as far as I know, using Fiber.each.

I tried below code and it looks like printing fibers from all threads. I am not sure how to confirm it actually spawned in multiple threads.

(1..4).each do |i|
  spawn name: "f#{i}", same_thread: false do
    loop do
      sleep 2.seconds
    end
  end
end


Fiber.unsafe_each do |fiber|
  puts fiber
end

Output is

$ crystal run -Dpreview_mt fibers_count.cr
#<Fiber:0x106894f20: main>
#<Fiber:0x106894e70: Fiber Clean Loop>
#<Fiber:0x106894dc0: Signal Loop>
#<Fiber:0x106894d10: Worker Loop>
#<Fiber:0x106894c60: main>
#<Fiber:0x106894bb0: main>
#<Fiber:0x106894b00: main>
#<Fiber:0x106894a50: f1>
#<Fiber:0x1068949a0: f2>
#<Fiber:0x1068948f0: f3>
#<Fiber:0x106894840: f4>

Can you have the fiber send some data when it starts and when it ends?
Basically can you keep take of how many fibers are active by recording it in anther place?

You might be able to monkeypatch a class variable and override Fiber#run to increment and decrement it. I haven’t tested it but I’m thinking something along these lines:

class Fiber
  @@active_count = Atomic.new(0)

  # Return the total number of fibers that have begun execution but 
  # have not completed yet.
  def self.active_count
    @@active_count.get
  end

  def run
    @@active_count.add 1
    previous_def
  ensure
    @@active_count.sub 1
  end
end

This makes reading the fiber count work in O(1) time and ensures thread safety on the increment and decrement. I’ve benchmarked Atomic before and, while I don’t have the numbers handy, it was a lot closer in performance to unsafe math than it was to mutex-wrapped math, so the overhead for this should be lower than spawning the fiber in most cases.

I relate to the need to check if there is fiber starvation. But assuming there is a counting fibers mechanism, what is the next step?

How would you like to diagnose where those fibers are been created?

I understand why fibers are special because of the runtime, but isn’t this an instance of counting object allocations and wanting to trace where those allocations happen?

Something we have envision is to integrate DTrace to offer support for this kind of stories, probably with opt-in/opt-out to avoid affecting performance. Knowing name, and line:no of the spawn might help to understand the system behaviour.

Using DTrace might not cover how to monitor running apps, the Atomic counter suggested @jgaskins would be approximately my suggestion. Note that that solution is not counting the main fiber per worker (they don’t go through the run method. But again, it will not help to understand why it’s happening in case numbers go up.

1 Like

If one creates fibers with text, with spawn "some text", the source can be tracked.
Of course, especially in the stdlib, unnamed fibers are often used (maybe could it change?). In this case, indeed, it won’t be easy to know where they has been created.