Hello!
I have this simple test program for getting URLs in parallel by fibers:
require "http"
NUM_FIBERS = 10
urls = [
"https://www.apple.com/",
"https://www.google.com/",
"https://www.ibm.com/",
"https://www.oracle.com/",
"https://www.intel.com/",
"https://www.sap.com/",
"https://www.nytimes.com/",
"https://cnn.com/",
"https://www.nasa.gov/",
"https://www.spacex.com/",
]
urls = urls * (NUM_FIBERS / urls.size).to_i32 # just make more urls...
puts "Getting #{urls.size} URLs..."
results_channel = Channel(HTTP::Client::Response | Exception).new # urls.size
urls.each do |url|
spawn do
begin
response = HTTP::Client.get url
results_channel.send response
rescue ex
results_channel.send ex
end
end
end
puts "Waiting for results channel..."
urls.size.times { results_channel.receive }
puts "Done"
When number of fibers is low, for example 10 or 100, everything is OK, but with the larger number of fiibers, for example 200, 300 or 1000:
NUM_FIBERS = 300
...
program never ends and gets stuck with the last message:
Getting 300 URLs...
Waiting for results channel...
Tried on the Ubuntu 23.10 and Alpine 3.19 (via docker). Behaviour is the same.
➤ crystal --version
Crystal 1.11.0 [95d04fab4] (2024-01-08)
LLVM: 15.0.7
Default target: x86_64-unknown-linux-gn
Where is the problem please? Thanks!
Can’t reproduce on an M2 with 1000 fibers (took 11 seconds).
I can reproduce. It seems to finish almost. In my trial, 297 fibers have completed. Then it’s blocked waiting for the remaining 3.
I retested it and changed real world URLs to my own localhost Kemal simple web app.
Now I am able to do this test with 10 000 URLs/fibers without a problem (no hang, pretty fast).
The problem with real world URLs is probably that some servers can detect excessive requests and do some tricks about it (they might make the request wait more time + network stuff like that, or my internet connection + ISP is a piece of…).
I realized, that HTTP::Client has no default timeouts (it means that client waits forever, right?), so I added timeout to the HTTP::Client instance and voila - I am now able to do it with 10 000 URLs/fibers (just out of curiosity) and it’s stable (memory usage was 6,5GB of RAM/RES).
So:
require "http"
NUM_FIBERS = 1000
TIMEOUT_SEC = 5
urls = [
"https://www.apple.com/aaa?a=1",
"https://www.google.com/aaa",
"https://www.ibm.com/",
"https://www.oracle.com/",
"https://www.intel.com/",
"https://www.sap.com/",
"https://www.nytimes.com/",
"https://cnn.com/",
"https://www.nasa.gov/",
"https://www.spacex.com/",
]
urls = urls * (NUM_FIBERS / urls.size).to_i32
puts "Getting #{urls.size} URLs..."
results_channel = Channel(HTTP::Client::Response | Exception).new # urls.size
urls.each do |url|
spawn do
begin
HTTP::Client.new(URI.parse url) do |cli|
cli.connect_timeout = cli.dns_timeout = cli.read_timeout = cli.write_timeout = TIMEOUT_SEC
response = cli.get URI.parse(url).request_target
results_channel.send response
end
rescue ex
results_channel.send ex
end
end
end
puts "Waiting for results channel..."
urls.size.times { results_channel.receive }
puts "Done"
3 Likes