Concurrent HTTP request performance

I’m trying to fetch a large set of records from an HTTP API concurrently and have structured my code like this:

require "http/client"
require "json"

fetch_queue = Channel(Int32).new
results_queue = Channel(JSON::Any).new

def fetch(url : String)
    res = HTTP::Client.get(url)
    JSON.parse(res.body)
end

spawn do
    (1..10000000).each do |id|
        fetch_queue.send(id)
    end
end

start_time = Time.utc
count = 0

spawn do
    loop do
        res = results_queue.receive
        
        count += 1
        if count % 1000 == 0
            elapsed = Time.utc - start_time
            seconds = elapsed.total_seconds
            rate = count / seconds
            puts "#{count} (#{elapsed}) @ #{rate.round}/sec"
        end
    end
end

(1..128).each do |i|
    spawn do
        loop do
            id = fetch_queue.receive
            res = fetch("https://example.com/api/items/#{id}/")
            results_queue.send(res)
        end
    end
end

sleep

I’m getting around 200 results per second at the moment. I tried implementing something similar with Node and I’m getting around 1600 results per second (while also persisting them to disk). I’m curious if I’m doing something wrong in the Crystal code above because I was expecting significantly higher throughput. I even tried compiling with crystal build --release -Dpreview_mt to get multiple threads but that didn’t seem to help much (probably since this is so IO-bound).

Any suggestions?

Thanks!

  • Increase fiber count?
  • Maybe the node http client share the connection or use connection pool ?

This doesn’t seem to help. I’ve tried spawning 1000 fibers on the fetching side and I still hover around 200 requests/sec.

Maybe - I’m using the built-in fetch function. I’ll need to look into the docs on that.

That seems very likely. I would certainly expect that.

In Crystal, HTTP::Client.get is a one-off request and establishes a new connection every time. That’s a huge overhead if you’re always connecting to the same host again and again.

I’m sure you could easily avoid that if you initialize a dedicated HTTP::Client instance for every worker fiber. This means requests can reuse the same connection.

Thanks so much! I’m reusing a connection per fiber now and getting ~1900 results/second. Amazing.

3 Likes

Yes, which is why the suggestion was for an HTTP::Client per fiber, versus sharing a single or pool of clients between fibers.

3 Likes

What to use for pooling? Trick the existing DB Pool from crystal-db project? (GitHub - crystal-lang/crystal-db: Common db api for crystal) - I read somewhere that it can be used…

Pooling doesn’t really make much sense in this use case because all workers are constantly communicating with the same endpoint. So each just gets its own client and connection, no need for pooling overhead.

(In case this is a generic question about pooling HTTP connections, please start a new thread for that)