In my app I have a http-server fiber and a worker fiber.
I observed under heavy worker load (reading/parsing/writing files within a loop), that the http server is not responding. Right after the worker fiber is finished with its job, the http server responds.
I aware of the possibility to pass the control back to the fiber scheduler with Fiber.yield.
I inserted the Fiber.yield method in the worker loop, however the situation didn’t change. Still the http server responded after the worker job.
How does the fiber scheduler decides which fiber is going to be the next to run ?
Do you have some example code? Otherwise best I can guess at the moment is that the worker fiber is doing something that is preventing execution from switching back to the main fiber that is running HTTP server. Which Is a bit interesting to me if you say it’s doing file operations as normally I would have thought the scheduler would process other fibers while waiting on IO when reading from the files. But it’s kinda hard to say for sure without some code to look at.
start_channel = Channel(Nil).new(1)
spawn name: "http_server" do
HTTP::Server.new do |context|
case context.request.path
when /^start$/
start_channel.send nil
context.response.status = HTTP::Status::OK
context.response.content_type = "text/json"
context.response << data.to_json
else
context.response.respond_with_status HTTP::Status::NOT_FOUND
end
end
end
spawn name: "worker" do
loop do
select
when start_channel.receive
io = IO::Memory.new
get_big_file_list.each do |file|
Compress::Zip::Writer.open(io) do |writer|
writer.add_file File.basename(file), File.read(file)
end
...
end
...
end
end
end
sleep
Sounds like a case of not all I/O operations being the same. File I/O isn’t blocking in the same way that socket I/O is. Blocking occurs when there isn’t data ready, but this is never the case for local files. Round trips to disk are often measured in nanoseconds so the CPU is never yielded.
This can even be the case for sockets when data is coming in over the wire as fast as or faster than you’re processing it. Since reading from a TCPSocket is really reading from a buffer in memory, if every TCPSocket#read requests less data than you currently have in the buffer (keeping in mind that the kernel also has its own socket buffers in addition Crystal IO::Buffered ones), you’re never actually waiting on the socket, so in some scenarios you never yield the CPU in socket I/O, either.
Parallelism would mostly improve throughput of your CPU-bound worker tasks. That might be meaningful if you need to run multiple workloads in parallel.
But multithreading is absolutely not necessary to have a snappy response from the HTTP server. Server and worker fiber should be perfectly able to coordinate sharing CPU time in single-thread concurrency.
If there are no reschedule points in a long-running tasks, that may require some strategically placed Fiber.yield calls to implement cooperative sharing.
You mentioned that you did already try that. Where did you put Fiber.yield in that example?
Fiber.yield will ensure that your other fibers get a chance to run — it literally pushes the current fiber onto the end of the scheduler’s queue. Since we don’t see all of the code in your worker fiber, and since you’re putting explicit Fiber.yield calls into the loop, it sounds like what’s hogging the CPU might be elsewhere.
If your first thought was that you were getting stuck on file I/O, it sounds like you’re shoveling a lot of data into that IO::Memory instance. That would probably consume a lot of RAM, so it might even be worth checking whether you’re dipping into swap.
Is it possible it’s getting past that code and the lines you’ve omitted are what’s actually CPU-bound?
Are you doing anything with the IO::Memory buffer after all the files have been shoveled into it that could be CPU-bound?
Does that select statement have an else in the actual code? If so, it won’t block at all and will basically just be executing loop { }, which will definitely not let your HTTP server work until it comes upon something that yields the CPU. It might be worth sticking a Fiber.yield just inside the loop block, as well or, if the select does contain an else clause, replace it with when timeout(1.second) to long-poll the start_channel.