Running a profiler on the original app I wrote for this thread (the one that yields the fiber and then serializes JSON) shows the event loop needs 12% of the CPU time available to that thread (1.55 seconds out of 12.82):
Scheduling is ~64ms out of that 1.55s (4%) so any optimization there will yield negligible results.
In a larger app, the event loop is significantly less of a concern. Here is the profile of a Reddit-like app I wrote a while back to test out the neo4j shard:
This hits a Neo4j database and outputs HTML, so it’s a reasonably realistic workload. The event loop used 399ms out of 5.98 seconds (6%). Fiber scheduling was 28ms — 0.5% of that fiber’s 6 CPU seconds. The rest of the event loop was all libevent.
Note: all of these traces contain 6 threads, but I only showed the heaviest one because they’re basically all the same
I’ve tried a hello world http server on MacOS with Crystal 0.31.0 and max I can get is about 48K req/s in single thread mode and 84K req/s in multithread mode with CRYSTAL_WORKERS=3 (other number of workers perform worse for me).
This is about 1.8 more throughput which is less that mentioned in first article https://crystal-lang.org/2019/09/06/parallelism-in-crystal.html where it goes from like 48K to about 120K for 3 workers as shown on charts. That’s about 2.5 more throughput.
UPDATE: Chart at https://crystal-lang.org/2019/09/06/parallelism-in-crystal.html it’s ~ 120K for CRYSTAL_WORKERS=4, but still, even CRYSTAL_WORKERS=2 seem to give better results than CRYSTAL_WORKERS=3 in my case.
I assume this has to do with either some commits landed after original article or my setup.
I pulled up the code from my original post and ran it on the same machine on v0.31.0 released through Homebrew and got 108518 reqs/sec (90.68µs avg latency) at 450-470% CPU usage. This lines up with my original findings of 108131 reqs/sec (93.54µs avg latency) at 458% CPU usage.
@vlazar You may want to try wrk with -c10. Having 128 simultaneous connections to a web process that serves its response in microseconds is equivalent to serving millions of requests per second from a single process and nobody’s actually doing that. :-) With something that responds that quickly, the odds are you’ll never have more than a few simultaneous connections, so the default -c10 is a more realistic workload.
Alternatively, you can try giving it a realistic workload to make the 128 simultaneous connections more realistic but, as @asterite mentioned earlier in the thread, existing DB connection pools may not be threadsafe.
Ah, nice! I saw the 0.7.0 release but I didn’t notice anything related to thread safety in it. It’s also possible I just didn’t know what to look for in it since I’m only using Crystal with Neo4j atm.
The main things solved in crystal-db 0.7.0 was the locking of checkout/release of connection that could generate some peaks above the expected limit.
There could be different configuration that might be better and haven’t been stressed out: is it better to have a single pool for the whole process or a pool per worker? If the later, the workers will not need to synchronise to checkout/release connections. But might not be applicable in the light of fiber stealing which today is not present but it haven’t been settle.
Interesting. When I profiled it (single threaded) it seemed the majority of the time was spent in #to_json, with a bunch of that being the time to convert floats to strings. And lots in reading /dev/urandom for UUID.rand .FWIW… :)
OK here’s some patches I found to speed it up a little but overall…couldn’t see much, except that past like 4-6 threads it seems to get all choked up on global locks allocs and realloc’s…dang…
You’re right. By default, macOS Instruments only aggregates calls to the same function when they’re in the same part of the call stack, and I forgot to toggle the setting on to aggregate them all. When I did that, it also showed JSON and UUID generation to be the hottest code. Nice catch!
Fun fact about UUID generation being a bottleneck, though, is that the Go app and Crystal apps aren’t generating the same number of UUIDs. The Go version generates a single UUID — it involves error handling (because of I/O) and building the string manually and I was being lazy. So I tested making the Crystal app only generate a single UUID to work more like the Go app and it gained 29% throughput (140k reqs/sec up from 108k).
Alpine static linked version VS Glibc dynamic link version, which is better?
For multiple thread, if we run on two thread, VS we run two process, and each proces only have one thread with external tools(e.g. foreman) and backed on a nginx as reverse proxy, which is better?
I would consider it a bug if one of them was measurably better than the other.
This is completely un-answerable without context and even with all information, I would say it comes down to what you consider better. Is raw throughput better? stability? observability and performance analysis? Ease of maintenance?
MT is not even out of preview phase, there are no Frameworks using it as far as I know, so i think it is way too early to ask those questions.
There are very certainly measureable differences between these different libc implementations. That’s just a matter of different strategies and trade-offs chosen by the lib authors. Not a bug.
Interesting!
I did write “measurable” but, is this big enough to matter?
i.e. glibc vs. the one used currently to build static binaries (forgot how its called)
So I am sure folks have taken a look at this and probably found issues, but for what its worth… here is a standard microbenchmark site that might provide some value, ( apologies if this was linked or mentioned else where in this forum ):