Multithreaded Crystal initial thoughts

jgaskins · September 13, 2019, 1:51pm

Running a profiler on the original app I wrote for this thread (the one that yields the fiber and then serializes JSON) shows the event loop needs 12% of the CPU time available to that thread (1.55 seconds out of 12.82):

Most of that is libevent (1.46 / 1.55s, or 94.2% of CPU time), so we can’t optimize that any further within Crystal:

Scheduling is ~64ms out of that 1.55s (4%) so any optimization there will yield negligible results.

In a larger app, the event loop is significantly less of a concern. Here is the profile of a Reddit-like app I wrote a while back to test out the neo4j shard:

This hits a Neo4j database and outputs HTML, so it’s a reasonably realistic workload. The event loop used 399ms out of 5.98 seconds (6%). Fiber scheduling was 28ms — 0.5% of that fiber’s 6 CPU seconds. The rest of the event loop was all libevent.

Note: all of these traces contain 6 threads, but I only showed the heaviest one because they’re basically all the same

vlazar · September 24, 2019, 6:23am

Now that Crystal 0.31.0 has been released! it’s probably a good idea to re-run benchmarks.

I’ve tried a hello world http server on MacOS with Crystal 0.31.0 and max I can get is about 48K req/s in single thread mode and 84K req/s in multithread mode with CRYSTAL_WORKERS=3 (other number of workers perform worse for me).

This is about 1.8 more throughput which is less that mentioned in first article https://crystal-lang.org/2019/09/06/parallelism-in-crystal.html where it goes from like 48K to about 120K for 3 workers as shown on charts. That’s about 2.5 more throughput.

UPDATE: Chart at https://crystal-lang.org/2019/09/06/parallelism-in-crystal.html it’s ~ 120K for CRYSTAL_WORKERS=4, but still, even CRYSTAL_WORKERS=2 seem to give better results than CRYSTAL_WORKERS=3 in my case.
I assume this has to do with either some commits landed after original article or my setup.

My env:

$ crystal --version
Crystal 0.31.0 (2019-09-24)

LLVM: 8.0.1
Default target: x86_64-apple-macosx

I’ve used http server from Crystal’s home page:

require "http/server"

server = HTTP::Server.new do |context|
  context.response.content_type = "text/plain"
  context.response.print "Hello world, got #{context.request.path}!"
end

puts "Listening on http://127.0.0.1:8080"
server.listen(8080)

With

$ wrk -d10s -t2 -c128 --latency http://localhost:8080/foo

bcardiff · September 24, 2019, 1:36pm

I did re-run the benchmarks after https://github.com/crystal-lang/crystal/commit/a0132d5bb328d232b3e9decf0c4059b26280b7e6 and I didn’t notice any notorious difference.

I guess the difference is in the setup.

asterite · September 24, 2019, 1:50pm

@bcardiff Did you run the benchmark after turning overflow checks on? That might affect performance.

vlazar · September 24, 2019, 1:53pm

Can this one degrade HTTP server performance? https://github.com/crystal-lang/crystal/pull/8168

jgaskins · September 24, 2019, 2:23pm

I pulled up the code from my original post and ran it on the same machine on v0.31.0 released through Homebrew and got 108518 reqs/sec (90.68µs avg latency) at 450-470% CPU usage. This lines up with my original findings of 108131 reqs/sec (93.54µs avg latency) at 458% CPU usage.

@vlazar You may want to try wrk with -c10. Having 128 simultaneous connections to a web process that serves its response in microseconds is equivalent to serving millions of requests per second from a single process and nobody’s actually doing that. :-) With something that responds that quickly, the odds are you’ll never have more than a few simultaneous connections, so the default -c10 is a more realistic workload.

Alternatively, you can try giving it a realistic workload to make the 128 simultaneous connections more realistic but, as @asterite mentioned earlier in the thread, existing DB connection pools may not be threadsafe.

asterite · September 24, 2019, 2:57pm

as @asterite mentioned earlier in the thread, existing DB connection pools may not be threadsafe

Now that 0.31.0 is out I think crystal-db is thread-safe. That’s also been the work of @bcardiff in the past release, as far as I know.

jgaskins · September 24, 2019, 3:15pm

Ah, nice! I saw the 0.7.0 release but I didn’t notice anything related to thread safety in it. It’s also possible I just didn’t know what to look for in it since I’m only using Crystal with Neo4j atm.

bcardiff · September 24, 2019, 3:25pm

The main things solved in crystal-db 0.7.0 was the locking of checkout/release of connection that could generate some peaks above the expected limit.

There could be different configuration that might be better and haven’t been stressed out: is it better to have a single pool for the whole process or a pool per worker? If the later, the workers will not need to synchronise to checkout/release connections. But might not be applicable in the light of fiber stealing which today is not present but it haven’t been settle.

bcardiff · September 24, 2019, 4:38pm

A recent benchmark for http server looks even slightly better.

rogerdpack · November 22, 2019, 7:27am

Interesting. When I profiled it (single threaded) it seemed the majority of the time was spent in #to_json, with a bunch of that being the time to convert floats to strings. And lots in reading /dev/urandom for UUID.rand .FWIW… :)

rogerdpack · November 23, 2019, 9:40am

OK here’s some patches I found to speed it up a little but overall…couldn’t see much, except that past like 4-6 threads it seems to get all choked up on global locks allocs and realloc’s…dang…

gist.github.com

https://gist.github.com/rdp/316a1d93c2de8eb8292edb53433769cf

faster_http.patch

diff --git a/src/http/request.cr b/src/http/request.cr
index f942c33f3..ade289eb1 100644
--- a/src/http/request.cr
+++ b/src/http/request.cr
@@ -110,7 +110,7 @@ class HTTP::Request
       request = new line.method, line.resource, headers, body, line.http_version, internal: nil
 
       if io.responds_to?(:remote_address)
-        request.remote_address = io.remote_address.try &.to_s
+        #request.remote_address = io.remote_address.try &.to_s

This file has been truncated. show original

(you can see by running gdb against it:

gdb -ex “set pagination 0” -ex “thread apply all bt” -batch -p

or using some other profiler.)

At the same time it did get to like 10% idle cpu on a box with 8 cores I was using for it. Maybe that’s good?

Guess there’s multi process if you need to use 100% of the cores

Or maybe somebody could dive in and make the bdwgc much faster somehow (more aggressive thread local sructures?)…or a different GC altogether… :)

jgaskins · November 30, 2019, 9:31pm

You’re right. By default, macOS Instruments only aggregates calls to the same function when they’re in the same part of the call stack, and I forgot to toggle the setting on to aggregate them all. When I did that, it also showed JSON and UUID generation to be the hottest code. Nice catch!

Fun fact about UUID generation being a bottleneck, though, is that the Go app and Crystal apps aren’t generating the same number of UUIDs. The Go version generates a single UUID — it involves error handling (because of I/O) and building the string manually and I was being lazy. So I tested making the Crystal app only generate a single UUID to work more like the Go app and it gained 29% throughput (140k reqs/sec up from 108k).

zw963 · June 10, 2022, 5:30pm

It pass almost 900 days since the last reply.

It’s time to do some compare again?

In fact, i have more interest compare idea:

Alpine static linked version VS Glibc dynamic link version, which is better?
For multiple thread, if we run on two thread, VS we run two process, and each proces only have one thread with external tools(e.g. foreman) and backed on a nginx as reverse proxy, which is better?

mavu · June 13, 2022, 8:57am

I would consider it a bug if one of them was measurably better than the other.

This is completely un-answerable without context and even with all information, I would say it comes down to what you consider better. Is raw throughput better? stability? observability and performance analysis? Ease of maintenance?

MT is not even out of preview phase, there are no Frameworks using it as far as I know, so i think it is way too early to ask those questions.

straight-shoota · June 13, 2022, 3:14pm

There are very certainly measureable differences between these different libc implementations. That’s just a matter of different strategies and trade-offs chosen by the lib authors. Not a bug.

mavu · June 13, 2022, 6:20pm

Interesting!
I did write “measurable” but, is this big enough to matter?
i.e. glibc vs. the one used currently to build static binaries (forgot how its called)

straight-shoota · June 14, 2022, 10:46am

I’m pretty sure that performance differences can matter depending on the type of work you need to do.

http://www.etalabs.net/compare_libcs.html

vinit812010 · February 21, 2024, 5:25pm

So I am sure folks have taken a look at this and probably found issues, but for what its worth… here is a standard microbenchmark site that might provide some value, ( apologies if this was linked or mentioned else where in this forum ):

Topic		Replies	Views
[Help] Optimization and code review Learning Resources	13	495	August 15, 2021
Crystal multithreading support Crystal Contrib	24	1902	December 15, 2024
Timeline for multithreading support Crystal Contrib	31	2304	September 24, 2024
Crystal vs Go: Notes and Reflections Learning Resources	3	7367	March 3, 2021
1.0 multi-threading memory use issues Crystal Contrib	32	1691	August 29, 2020

Multithreaded Crystal initial thoughts

Related topics