The Crystal Programming Language Forum

Multithreaded Crystal initial thoughts

To do that, run the program with an env var like CRYSTAL_WORKERS=8

Tuning the thread pool was the first thing I tried. :-)

I was using CRYSTAL_WORKERS=6 for these benchmarks, which is what allowed it to use up to 450-475% CPU. With the raw-string responses, I was only using ~380% CPU and still handling 170k reqs/sec — the Go code used 450% to do the same work. kernel_task was hitting 100% CPU on its own handling the I/O, so I don’t think there’s a way I can get > 170k while running wrk (which was at 180% CPU) and the Crystal server on the same macOS box.

I tried on another beefy DigitalOcean droplet because that’s a closer environment to what people will actually be running web servers on. I couldn’t get it to compile Crystal for some reason and I don’t have time to look into it before work, so I’ll have to try again over the weekend.

2 Likes

It would be interesting to compare to Go with fasthttp instead of standard net/http.

Go with fasthttp can be much faster:

The main reason seem to be less memory allocations:

But boy I prefer Crystal

Feel free to try it out if you’re interested! You clearly know more about Go than I do. :smile:

My own preference is to look at apps that are performing realistic workloads, which is why I serialized JSON to begin with. I’d like to see it do more realistic amount of work within the request, tbh (I’m not interested in how fast I can make an app do nothing :joy:), like talking to a DB, cache, etc. I’m just having trouble getting that working right now so that’ll have to wait a bit longer.

2 Likes

like talking to a DB

Just note that crystal-db isn’t prepared for multithreading right now, but will soon be. So if you want to write a benchmark using that you’ll most likely get crashes or similar.

3 Likes

With the web requests one is it using 100% cpu’s?

yes

It would be interesting to see the output from a sampling cpu profiler for those 100% cpu runs (i.e. “where is it using all that extra cpu” since I guess there’s theoretically 8 cores but it’s only 4x as fast… :) hmm…basically just out of curiosity…

Running a profiler on the original app I wrote for this thread (the one that yields the fiber and then serializes JSON) shows the event loop needs 12% of the CPU time available to that thread (1.55 seconds out of 12.82):

Most of that is libevent (1.46 / 1.55s, or 94.2% of CPU time), so we can’t optimize that any further within Crystal:

Scheduling is ~64ms out of that 1.55s (4%) so any optimization there will yield negligible results.

In a larger app, the event loop is significantly less of a concern. Here is the profile of a Reddit-like app I wrote a while back to test out the neo4j shard:

This hits a Neo4j database and outputs HTML, so it’s a reasonably realistic workload. The event loop used 399ms out of 5.98 seconds (6%). Fiber scheduling was 28ms — 0.5% of that fiber’s 6 CPU seconds. The rest of the event loop was all libevent.

Note: all of these traces contain 6 threads, but I only showed the heaviest one because they’re basically all the same

4 Likes

Now that Crystal 0.31.0 has been released! it’s probably a good idea to re-run benchmarks.

I’ve tried a hello world http server on MacOS with Crystal 0.31.0 and max I can get is about 48K req/s in single thread mode and 84K req/s in multithread mode with CRYSTAL_WORKERS=3 (other number of workers perform worse for me).

This is about 1.8 more throughput which is less that mentioned in first article https://crystal-lang.org/2019/09/06/parallelism-in-crystal.html where it goes from like 48K to about 120K for 3 workers as shown on charts. That’s about 2.5 more throughput.

UPDATE: Chart at https://crystal-lang.org/2019/09/06/parallelism-in-crystal.html it’s ~ 120K for CRYSTAL_WORKERS=4, but still, even CRYSTAL_WORKERS=2 seem to give better results than CRYSTAL_WORKERS=3 in my case.
I assume this has to do with either some commits landed after original article or my setup.

My env:

$ crystal --version
Crystal 0.31.0 (2019-09-24)

LLVM: 8.0.1
Default target: x86_64-apple-macosx

I’ve used http server from Crystal’s home page:

require "http/server"

server = HTTP::Server.new do |context|
  context.response.content_type = "text/plain"
  context.response.print "Hello world, got #{context.request.path}!"
end

puts "Listening on http://127.0.0.1:8080"
server.listen(8080)

With

$ wrk -d10s -t2 -c128 --latency http://localhost:8080/foo
2 Likes

I did re-run the benchmarks after https://github.com/crystal-lang/crystal/commit/a0132d5bb328d232b3e9decf0c4059b26280b7e6 and I didn’t notice any notorious difference.

I guess the difference is in the setup.

@bcardiff Did you run the benchmark after turning overflow checks on? That might affect performance.

Can this one degrade HTTP server performance? https://github.com/crystal-lang/crystal/pull/8168

I pulled up the code from my original post and ran it on the same machine on v0.31.0 released through Homebrew and got 108518 reqs/sec (90.68µs avg latency) at 450-470% CPU usage. This lines up with my original findings of 108131 reqs/sec (93.54µs avg latency) at 458% CPU usage.

@vlazar You may want to try wrk with -c10. Having 128 simultaneous connections to a web process that serves its response in microseconds is equivalent to serving millions of requests per second from a single process and nobody’s actually doing that. :-) With something that responds that quickly, the odds are you’ll never have more than a few simultaneous connections, so the default -c10 is a more realistic workload.

Alternatively, you can try giving it a realistic workload to make the 128 simultaneous connections more realistic but, as @asterite mentioned earlier in the thread, existing DB connection pools may not be threadsafe.

1 Like

as @asterite mentioned earlier in the thread, existing DB connection pools may not be threadsafe

Now that 0.31.0 is out I think crystal-db is thread-safe. That’s also been the work of @bcardiff in the past release, as far as I know.

1 Like

Ah, nice! I saw the 0.7.0 release but I didn’t notice anything related to thread safety in it. It’s also possible I just didn’t know what to look for in it since I’m only using Crystal with Neo4j atm.

The main things solved in crystal-db 0.7.0 was the locking of checkout/release of connection that could generate some peaks above the expected limit.

There could be different configuration that might be better and haven’t been stressed out: is it better to have a single pool for the whole process or a pool per worker? If the later, the workers will not need to synchronise to checkout/release connections. But might not be applicable in the light of fiber stealing which today is not present but it haven’t been settle.

1 Like

A recent benchmark for http server looks even slightly better.

3 Likes

Interesting. When I profiled it (single threaded) it seemed the majority of the time was spent in #to_json, with a bunch of that being the time to convert floats to strings. And lots in reading /dev/urandom for UUID.rand .FWIW… :)