New Event Loop (UNIX): call for reviews & tests

ysbaddaden · September 24, 2024, 5:05pm

Following the changes from RFC 7 (see RFC 0007: Event loop refactor), I started experimenting with epoll and kqueue, which led me to open multiple pull requests that quickly got superseded over the last 2 months.

The long development has come to an end: the final pull request is ready!

If there are any brave souls, you can try to review it (please) but be aware that the changes are significant Hopefully there should be enough comments and documentation and the order of commits can help to grasp some of the introduced concepts.

It most likely won’t make it into 1.14 since we’re so close to the release. Even behind a preview flag, it might not make much sense to push something that might have issues in a release. Let’s polish it over the development of 1.15 instead.

Alternatively, there’s nothing preventing you from testing the branch: checkout the branch and recompile your application, and hopefully … otherwise please report regressions or issues. In the very best scenario (e.g. purely IO bound with long lived connections) you might experience up to +20% improvement , at worst it shouldn’t be slower than libevent.

ysbaddaden · September 24, 2024, 5:15pm

Notes:

The changes are significant yet there are many things I kept for future improvements, like a skiplist for timers, integrating signalfd / EVFILT_SIGNAL as well as pidfd / EVFILT_PROC, and more things. This is only the “core” of more improvements to the event loop
I documented myself on io_uring and have some preliminary design ideas about how to integrate it in Crystal, but that will have to wait. I’m working on RFC 2 now, and trying to take advantage of that new evloop

sdogruyol · September 24, 2024, 6:59pm

Thanks a lot @ysbaddaden Just compiled a 7K LOC Kemal application with this branch. Run some benchmarks with wrk and I didn’t see any errors, the performance is on par with Crystal 1.13.1 maybe %1 - %2 faster which is within error margin.

zw963 · September 25, 2024, 4:46am

The magic door has finally opened！Cheers！

Karl-Goethebier · September 25, 2024, 4:47pm

From ibidem:

“Most application shouldn’t notice any impact because of this design choice, since a fd is usually not shared across fibers (concurrency issues), except maybe a server socket with multiple accepting fibers? In that case you’ll need to make sure the fibers are on the same thread (preview_mt) or same context (RFC 2).”

Does this mean that under certain conditions -Dpreview_mt is or becomes obsolete?

straight-shoota · September 25, 2024, 8:22pm

Yes. A new multithreading environment is coming up, as discussed in RFC 2. The event loop refactor is a prerequisite for that. It’ll replace the old -Dpreview_mt.

MistressRemilia · September 25, 2024, 8:31pm

I tested this with the current Benben trunk code, about 71K LOC total, and things run/compile just fine. Performance seems to take a decent hit, though I’m also comparing Crystal 1.12.2 to this PR. I compiled Crystal myself both times.

Rendering 79 VGM files to WAV, four in parallel on four cores:

Crystal v1.12.2: rendering took 03:28.087905842 with an avg samples/sec of 906,261
This PR: rendering took 03:49.727951144 with an avg samples/sec of 821,117

The machine:

Ryzen 3 3200U, 10GB RAM
Slackware 15.0
Kernel 6.10.11
LLVM 13
Samsung 970 EVO Plus SSD

Karl-Goethebier · September 26, 2024, 8:37am

Thanks. Yet another proof that reading helps.

ysbaddaden · September 26, 2024, 10:48am

@MistressRemilia Damn. I assume Benben is working with disk files, right? Meaning that it should never block (disk files are always ready) and thus never enter epoll. Maybe this is an impact of the EventLoop#read and #write methods that are slightly different?

Would you have some reproducible steps? I don’t know where to find VGM files for instance.

ysbaddaden · September 26, 2024, 12:32pm

@MistressRemilia I checked out your code (nice, I learned a bit of Fossil), and you only use disk files, no sockets, and I can’t seem to pinpoint anything fundamentally different in both evloops to explain why it would be slower to read, write or close from files (it’s always blocking).

Then the next explanation would be the timers: sleeps (obviously), fiber yield (they create a 0 second timer). You don’t use IO timeouts or select with timeouts, but you do use sleep and Fiber.yield.

They’re the weak point of this evloop refactor: the Crystal::Evented::Timers is just a placeholder for a better data structure; it’s thread unsafe and is protected by a spin lock (not even a mutex) which creates a contention point for threads, etc.

mavu · September 26, 2024, 12:59pm

Is this to be understood as in “it’s known, and will change” or as in “it’s a limitation of the epoll eventloop refactor and we can’t do much about it”?

beta-ziliani · September 26, 2024, 1:21pm

I’m certain ~we~ he’ll fix it in a future PR.

MistressRemilia · September 26, 2024, 9:06pm

Yeah, it’s only writing stuff to disk, no sockets or anything. Specifically, using --render will cause it to write rendered audio data to WAV (or Au) files. If you want some VGM files to test, check VGMRips (maybe the X68000 version of Castlevania since I use it as part of my benchmarking). Or just test it with FLAC/MP3/Ogg/WavPack/etc.

FWIW, I also checked it against 1.13.3 just now and the regression doesn’t appear in that version. So I think you may be on to something wit the Timers.

dsisnero · October 10, 2024, 2:21am

what if any changes / decisions are needed for io_uring support? Windows also has IORing api which is similar.

straight-shoota · October 10, 2024, 8:05am

The architecture is already designed with io_uring support in mind. It’s not implemented yet, but we know what it needs.
In fact, the IOCP-based implementation on Windows is already very similar in behaviour to io_uring. I.e. it schedules tasks and waits for their completion instead of polling.

ysbaddaden · October 10, 2024, 12:30pm

@dsisnero IOCP and io_uring are the very reason we have RFC 7 to abstract how the event loop is working internally.

ysbaddaden · November 5, 2024, 6:36pm

@MistressRemilia I can’t reproduce the performance hit on my Intel 14700K (Ubuntu 22.04, Linux 6.9, LLVM 18).

I downloaded the VGM from the “top 5 best rated” on vgmrips (dragon saber, outrun, battle garegga, daytone usa + akumajo dracula). I removed the xx Marginal Consciousness (in game loop) file because it took forever to transcode.

I tried with 4 threads & jobs, then 20 (reached cpu cores) as well as 28 (reached cpu threads) threads & jobs, but I always get roughly identical timings (time spent & average sample), be it with the Crystal 1.14.0 release or the evloop branch

I also tried to replace the deque in timers for a minheap then a skiplist, but to no effect: I still reach the same overall performance in benben.

yxhuvud · November 5, 2024, 8:43pm

What heap implementation? Regular binary heaps are typically not very fast… Not that it probably matters in any way whatsoever for this as as there was no difference measured.

Trying to figure out things like this is where it would be nice to have more powerful instrumentation so that it would be possible to account the time cost to different parts of the runtime or program.

MistressRemilia · November 5, 2024, 9:03pm

Hrm… let me go back and thoroughly document my steps today. There must be some odd difference in how I’m doing it.

ysbaddaden · November 5, 2024, 10:06pm

@yxhuvud They’re actually super fast… well, mostly. Both Go and libevent use minheaps.

I’m using a 4-heap (see d-ary heap). It has flawless performance on insert, great performance on delete-min, but arbitrary delete becomes painfully slow when the heap grows (over a 1000 timers) because lookups are slow (linear scans).

The main advantage is that a heap is basically an array, and a 4 heap considers 4 children (instead of 2) and we can take full advantage of CPU caches (not much thrashing).

The skiplist is nice, it has excellent performance on delete-min, arbitrary delete ain’t fast but never becomes painfully slow (thanks to the fast lookups) yet inserts are significantly slower than the above minheap because it must allocate, then following nodes is subject to CPU cache thrashing…

I’ll try to keep a list of free nodes. That may help with the allocations (but won’t help with the cache thrashing)

Topic		Replies	Views
Multithreaded Crystal initial thoughts	38	7474	February 21, 2024
There is a way to optimize this program? Help & Support	61	1165	November 22, 2024
1BRC in Crystal News	15	725	February 12, 2024
Is concurrency broken? Community	14	1517	October 28, 2019
Curious about the EventLoop updates Crystal Contrib	4	209	May 7, 2024

New Event Loop (UNIX): call for reviews & tests

Related topics