Multithreaded Crystal initial thoughts

yes

It would be interesting to see the output from a sampling cpu profiler for those 100% cpu runs (i.e. “where is it using all that extra cpu” since I guess there’s theoretically 8 cores but it’s only 4x as fast… :) hmm…basically just out of curiosity…

Running a profiler on the original app I wrote for this thread (the one that yields the fiber and then serializes JSON) shows the event loop needs 12% of the CPU time available to that thread (1.55 seconds out of 12.82):

Most of that is libevent (1.46 / 1.55s, or 94.2% of CPU time), so we can’t optimize that any further within Crystal:

Scheduling is ~64ms out of that 1.55s (4%) so any optimization there will yield negligible results.

In a larger app, the event loop is significantly less of a concern. Here is the profile of a Reddit-like app I wrote a while back to test out the neo4j shard:

This hits a Neo4j database and outputs HTML, so it’s a reasonably realistic workload. The event loop used 399ms out of 5.98 seconds (6%). Fiber scheduling was 28ms — 0.5% of that fiber’s 6 CPU seconds. The rest of the event loop was all libevent.

Note: all of these traces contain 6 threads, but I only showed the heaviest one because they’re basically all the same

6 Likes

Now that Crystal 0.31.0 has been released! it’s probably a good idea to re-run benchmarks.

I’ve tried a hello world http server on MacOS with Crystal 0.31.0 and max I can get is about 48K req/s in single thread mode and 84K req/s in multithread mode with CRYSTAL_WORKERS=3 (other number of workers perform worse for me).

This is about 1.8 more throughput which is less that mentioned in first article https://crystal-lang.org/2019/09/06/parallelism-in-crystal.html where it goes from like 48K to about 120K for 3 workers as shown on charts. That’s about 2.5 more throughput.

UPDATE: Chart at https://crystal-lang.org/2019/09/06/parallelism-in-crystal.html it’s ~ 120K for CRYSTAL_WORKERS=4, but still, even CRYSTAL_WORKERS=2 seem to give better results than CRYSTAL_WORKERS=3 in my case.
I assume this has to do with either some commits landed after original article or my setup.

My env:

$ crystal --version
Crystal 0.31.0 (2019-09-24)

LLVM: 8.0.1
Default target: x86_64-apple-macosx

I’ve used http server from Crystal’s home page:

require "http/server"

server = HTTP::Server.new do |context|
  context.response.content_type = "text/plain"
  context.response.print "Hello world, got #{context.request.path}!"
end

puts "Listening on http://127.0.0.1:8080"
server.listen(8080)

With

$ wrk -d10s -t2 -c128 --latency http://localhost:8080/foo
2 Likes

I did re-run the benchmarks after https://github.com/crystal-lang/crystal/commit/a0132d5bb328d232b3e9decf0c4059b26280b7e6 and I didn’t notice any notorious difference.

I guess the difference is in the setup.

@bcardiff Did you run the benchmark after turning overflow checks on? That might affect performance.

Can this one degrade HTTP server performance? https://github.com/crystal-lang/crystal/pull/8168

I pulled up the code from my original post and ran it on the same machine on v0.31.0 released through Homebrew and got 108518 reqs/sec (90.68µs avg latency) at 450-470% CPU usage. This lines up with my original findings of 108131 reqs/sec (93.54µs avg latency) at 458% CPU usage.

@vlazar You may want to try wrk with -c10. Having 128 simultaneous connections to a web process that serves its response in microseconds is equivalent to serving millions of requests per second from a single process and nobody’s actually doing that. :-) With something that responds that quickly, the odds are you’ll never have more than a few simultaneous connections, so the default -c10 is a more realistic workload.

Alternatively, you can try giving it a realistic workload to make the 128 simultaneous connections more realistic but, as @asterite mentioned earlier in the thread, existing DB connection pools may not be threadsafe.

1 Like

as @asterite mentioned earlier in the thread, existing DB connection pools may not be threadsafe

Now that 0.31.0 is out I think crystal-db is thread-safe. That’s also been the work of @bcardiff in the past release, as far as I know.

1 Like

Ah, nice! I saw the 0.7.0 release but I didn’t notice anything related to thread safety in it. It’s also possible I just didn’t know what to look for in it since I’m only using Crystal with Neo4j atm.

The main things solved in crystal-db 0.7.0 was the locking of checkout/release of connection that could generate some peaks above the expected limit.

There could be different configuration that might be better and haven’t been stressed out: is it better to have a single pool for the whole process or a pool per worker? If the later, the workers will not need to synchronise to checkout/release connections. But might not be applicable in the light of fiber stealing which today is not present but it haven’t been settle.

1 Like

A recent benchmark for http server looks even slightly better.

5 Likes

Interesting. When I profiled it (single threaded) it seemed the majority of the time was spent in #to_json, with a bunch of that being the time to convert floats to strings. And lots in reading /dev/urandom for UUID.rand .FWIW… :)

OK here’s some patches I found to speed it up a little but overall…couldn’t see much, except that past like 4-6 threads it seems to get all choked up on global locks allocs and realloc’s…dang…

(you can see by running gdb against it:

gdb -ex “set pagination 0” -ex “thread apply all bt” -batch -p

or using some other profiler.)

At the same time it did get to like 10% idle cpu on a box with 8 cores I was using for it. Maybe that’s good?

Guess there’s multi process if you need to use 100% of the cores :slight_smile:

Or maybe somebody could dive in and make the bdwgc much faster somehow (more aggressive thread local sructures?)…or a different GC altogether… :)

1 Like

You’re right. By default, macOS Instruments only aggregates calls to the same function when they’re in the same part of the call stack, and I forgot to toggle the setting on to aggregate them all. When I did that, it also showed JSON and UUID generation to be the hottest code. Nice catch!

Fun fact about UUID generation being a bottleneck, though, is that the Go app and Crystal apps aren’t generating the same number of UUIDs. The Go version generates a single UUID — it involves error handling (because of I/O) and building the string manually and I was being lazy. So I tested making the Crystal app only generate a single UUID to work more like the Go app and it gained 29% throughput (140k reqs/sec up from 108k). :exploding_head:

6 Likes

It pass almost 900 days since the last reply.

It’s time to do some compare again?

In fact, i have more interest compare idea:

  1. Alpine static linked version VS Glibc dynamic link version, which is better?
  2. For multiple thread, if we run on two thread, VS we run two process, and each proces only have one thread with external tools(e.g. foreman) and backed on a nginx as reverse proxy, which is better?

I would consider it a bug if one of them was measurably better than the other.

This is completely un-answerable without context and even with all information, I would say it comes down to what you consider better. Is raw throughput better? stability? observability and performance analysis? Ease of maintenance?

MT is not even out of preview phase, there are no Frameworks using it as far as I know, so i think it is way too early to ask those questions.

2 Likes

There are very certainly measureable differences between these different libc implementations. That’s just a matter of different strategies and trade-offs chosen by the lib authors. Not a bug.

1 Like

Interesting!
I did write “measurable” but, is this big enough to matter?
i.e. glibc vs. the one used currently to build static binaries (forgot how its called)

I’m pretty sure that performance differences can matter depending on the type of work you need to do.

http://www.etalabs.net/compare_libcs.html

1 Like