Multithreaded Crystal initial thoughts

The multithreading preview for Crystal was just merged in yesterday so I’ve been playing around with it a bit to check its performance. Sure enough, I was able to saturate all 32 cores on a DigitalOcean droplet with 32.times { spawn { loop {} } }:

I wrote up a quick HTTP::Server app to check performance (I don’t know how to get a local build of Crystal to use shards or I’d check on a real app). Here’s the single-thread benchmark (via wrk, output trimmed):

  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   199.01us   90.71us   8.53ms   97.75%
Requests/sec:  50590.09

For comparison purposes, the release version of v0.30.1 gets 50013 reqs/sec with the same code on my machine. Still in the same ballpark, but it’s really heartening to know that the changes didn’t result in worse performance in single-thread mode. In multi-thread mode with a single thread, performance did drop a bit to 47457 — approximately a 5% reduction.

With the preview_mt flag enabled (crystal run -Dpreview_mt --release on Crystal master:

  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    93.54us   77.55us   3.36ms   96.54%
Requests/sec: 108131.26

This is awesome, we get more throughput from a single process! Latency is lower! In fact, this is the first time I’ve been able to get wrk to consume more than 100% CPU before!

It’s not proportional, though, unfortunately. The first version of the app consumes 100% CPU and the second consumes 460%. But rather than a 4.6x improvement in throughput we only get 2.14x. Still a win, just not the one I was expecting. This isn’t intended as a criticism, just an observation that it’s probably not ready yet. :-)

If someone can drop some tips on how to get a locally built Crystal to compile with shards I’d be happy to run this against a real app instead of poorly simulated work. :joy: I have a feeling that throughput might be a little more proportional when the request does real work since scheduling fibers will likely be a much smaller ratio of the total work being done. Getting the DB to keep up might be challenging, though.


I feel like the call to spawn should default to the same thread (maintaining backwards compatibility)
Whereas if you want a new thread you could do something like

spawn new_thread: true { operation }

Thats already a thing*,name:String?=nil,same_thread=false,&block)-class-method

however same_thread is defaulted to false.

Yeah I know, however I figure that spawning new fibers will still be more common than threads and all existing code expects to be spawning a fiber

spawn doesn’t create a new thread with this implementation. :smile: In fact, new threads are never created after it begins executing your code. The scheduler spins up a thread pool during bootstrapping and new fibers are assigned to one of those threads. :exploding_head:

It looks like it assigns them in a round-robin so it may or may not be assigned to the same thread as the fiber that spawned it.

1 Like

I think this is expected. Part of the time goes in switching contexts between threads and fibers. I believe there might be more optimizations to improve this situation but I don’t think we’ll get to 4.6x improvement in throughput.

It would be interesting to do a similar benchmark using Go. I know Go does context switches much faster because they need to preserve less amount of registers than we do. And of course they are Google too, so… :grin:


Oh, true, sorry. I wasn’t actually expecting 1:1 scaling with CPU time vs throughput. But with 4.6x CPU consumption I think 4x throughput is reasonable, leaving ~10% for additional logistical overhead.

Either way, the simplicity and expressiveness of this implementation is unbelievably good. I believe that’s more important as a starting point. I’m reasonably sure there are some places to optimize that will yield pretty nice — I’ve got my eye on a couple places and I’ll be experimenting with it a bit this week. :slight_smile:

Agreed! I think performance comparisons with Go are awesome to see. Sometimes Go wins, sometimes Crystal wins, and I think that in places where Go is faster there are lessons for Crystal.

I decided to try it out because I was curious myself. I am not a Go programmer and, in fact, this is literally the first Go program I’ve ever written, but I was able to make it work. I updated the gist I linked in my original post to include that Go app.

The Go server uses the same amount of CPU. Hope you’re sitting down for this next part:

  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   129.40us   91.54us   7.59ms   95.51%
Requests/sec:  69332.77

… and Crystal beats it in throughput by 56%. :exploding_head: Feel free to check my work because, like I said, this is the first Go program I’ve ever written.



I see that in your benchmark you serialize JSON. Go is known to have slow JSON serialization because it uses reflection (and Crystal doesn’t). Could you try running a benchmark where Crystal and Go just send “Hello world” in the response? That way we would be comparing just the HTTP serving part which is where multithreading (context switches, scheduler, etc.) are mainly exercised.

But of course, real world apps use JSON serialization so even without the simpler benchmark this is great news! Thank you for doing these benchmarks :heart:


Removing the JSON serialization from the Go app only added ~18% throughput.

  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   109.05us  119.22us   9.19ms   99.46%
Requests/sec:  81592.82
Golang code here
package main

import (

func main() {
  http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
    fmt.Fprintf(w, "Hello")

  log.Fatal(http.ListenAndServe(":54321", nil))

It appears Crystal returns a nontrivial JSON payload faster than Go writes a hard-coded string, but let’s check Crystal anyway:

  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    54.06us   23.38us 723.00us   96.85%
Requests/sec: 173259.23

I may be going out on a limb here but … ummm … I think we’re good?

The only thing I changed in the Crystal code was to make the call method just run context.response << "Hello" to match the Go app (also removed the Fiber.yield for the same reason).


Since your HTTP handlers don’t do anything, each request is executed blazingly fast. I suppose the network stack could already be congested. You could try with more and less threads to see if there is a better utilization ratio.

To do that, run the program with an env var like CRYSTAL_WORKERS=8

Tuning the thread pool was the first thing I tried. :-)

I was using CRYSTAL_WORKERS=6 for these benchmarks, which is what allowed it to use up to 450-475% CPU. With the raw-string responses, I was only using ~380% CPU and still handling 170k reqs/sec — the Go code used 450% to do the same work. kernel_task was hitting 100% CPU on its own handling the I/O, so I don’t think there’s a way I can get > 170k while running wrk (which was at 180% CPU) and the Crystal server on the same macOS box.

I tried on another beefy DigitalOcean droplet because that’s a closer environment to what people will actually be running web servers on. I couldn’t get it to compile Crystal for some reason and I don’t have time to look into it before work, so I’ll have to try again over the weekend.


It would be interesting to compare to Go with fasthttp instead of standard net/http.

Go with fasthttp can be much faster:

The main reason seem to be less memory allocations:

But boy I prefer Crystal

Feel free to try it out if you’re interested! You clearly know more about Go than I do. :smile:

My own preference is to look at apps that are performing realistic workloads, which is why I serialized JSON to begin with. I’d like to see it do more realistic amount of work within the request, tbh (I’m not interested in how fast I can make an app do nothing :joy:), like talking to a DB, cache, etc. I’m just having trouble getting that working right now so that’ll have to wait a bit longer.


like talking to a DB

Just note that crystal-db isn’t prepared for multithreading right now, but will soon be. So if you want to write a benchmark using that you’ll most likely get crashes or similar.


With the web requests one is it using 100% cpu’s?


It would be interesting to see the output from a sampling cpu profiler for those 100% cpu runs (i.e. “where is it using all that extra cpu” since I guess there’s theoretically 8 cores but it’s only 4x as fast… :) hmm…basically just out of curiosity…