New Event Loop (UNIX): call for reviews & tests

Following the changes from RFC 7 (see RFC 0007: Event loop refactor), I started experimenting with epoll and kqueue, which led me to open multiple pull requests that quickly got superseded over the last 2 months.

The long development has come to an end: the final pull request is ready!

If there are any brave souls, you can try to review it (please) but be aware that the changes are significant :sweat_smile: Hopefully there should be enough comments and documentation and the order of commits can help to grasp some of the introduced concepts.

It most likely wonā€™t make it into 1.14 since weā€™re so close to the release. Even behind a preview flag, it might not make much sense to push something that might have issues in a release. Letā€™s polish it over the development of 1.15 instead.

Alternatively, thereā€™s nothing preventing you from testing the branch: checkout the branch and recompile your application, and hopefully :tada: ā€¦ otherwise please report regressions or issues. In the very best scenario (e.g. purely IO bound with long lived connections) you might experience up to +20% improvement , at worst it shouldnā€™t be slower than libevent.

9 Likes

Notes:

  1. The changes are significant yet there are many things I kept for future improvements, like a skiplist for timers, integrating signalfd / EVFILT_SIGNAL as well as pidfd / EVFILT_PROC, and more things. This is only the ā€œcoreā€ of more improvements to the event loop :sweat_smile:

  2. I documented myself on io_uring and have some preliminary design ideas about how to integrate it in Crystal, but that will have to wait. Iā€™m working on RFC 2 now, and trying to take advantage of that new evloop :grin:

1 Like

Thanks a lot @ysbaddaden :heart: Just compiled a 7K LOC Kemal application with this branch. Run some benchmarks with wrk and I didnā€™t see any errors, the performance is on par with Crystal 1.13.1 maybe %1 - %2 faster which is within error margin.

3 Likes

The magic door has finally openedļ¼Cheersļ¼

From ibidem:

ā€œMost application shouldnā€™t notice any impact because of this design choice, since a fd is usually not shared across fibers (concurrency issues), except maybe a server socket with multiple accepting fibers? In that case youā€™ll need to make sure the fibers are on the same thread (preview_mt) or same context (RFC 2).ā€

Does this mean that under certain conditions -Dpreview_mt is or becomes obsolete?

Yes. A new multithreading environment is coming up, as discussed in RFC 2. The event loop refactor is a prerequisite for that. Itā€™ll replace the old -Dpreview_mt.

2 Likes

I tested this with the current Benben trunk code, about 71K LOC total, and things run/compile just fine. Performance seems to take a decent hit, though Iā€™m also comparing Crystal 1.12.2 to this PR. I compiled Crystal myself both times.

Rendering 79 VGM files to WAV, four in parallel on four cores:

  • Crystal v1.12.2: rendering took 03:28.087905842 with an avg samples/sec of 906,261
  • This PR: rendering took 03:49.727951144 with an avg samples/sec of 821,117

The machine:

  • Ryzen 3 3200U, 10GB RAM
  • Slackware 15.0
  • Kernel 6.10.11
  • LLVM 13
  • Samsung 970 EVO Plus SSD
4 Likes

Thanks. Yet another proof that reading helps.

@MistressRemilia Damn. I assume Benben is working with disk files, right? Meaning that it should never block (disk files are always ready) and thus never enter epoll. Maybe this is an impact of the EventLoop#read and #write methods that are slightly different? :thinking:

Would you have some reproducible steps? I donā€™t know where to find VGM files for instance.

1 Like

@MistressRemilia I checked out your code (nice, I learned a bit of Fossil), and you only use disk files, no sockets, and I canā€™t seem to pinpoint anything fundamentally different in both evloops to explain why it would be slower to read, write or close from files (itā€™s always blocking).

Then the next explanation would be the timers: sleeps (obviously), fiber yield (they create a 0 second timer). You donā€™t use IO timeouts or select with timeouts, but you do use sleep and Fiber.yield.

Theyā€™re the weak point of this evloop refactor: the Crystal::Evented::Timers is just a placeholder for a better data structure; itā€™s thread unsafe and is protected by a spin lock (not even a mutex) which creates a contention point for threads, etc.

Is this to be understood as in ā€œitā€™s known, and will changeā€ or as in ā€œitā€™s a limitation of the epoll eventloop refactor and we canā€™t do much about itā€?

1 Like

Iā€™m certain ~we~ heā€™ll fix it in a future PR.

3 Likes

Yeah, itā€™s only writing stuff to disk, no sockets or anything. Specifically, using --render will cause it to write rendered audio data to WAV (or Au) files. If you want some VGM files to test, check VGMRips (maybe the X68000 version of Castlevania since I use it as part of my benchmarking). Or just test it with FLAC/MP3/Ogg/WavPack/etc.

FWIW, I also checked it against 1.13.3 just now and the regression doesnā€™t appear in that version. So I think you may be on to something wit the Timers.

2 Likes

what if any changes / decisions are needed for io_uring support? Windows also has IORing api which is similar.

The architecture is already designed with io_uring support in mind. Itā€™s not implemented yet, but we know what it needs.
In fact, the IOCP-based implementation on Windows is already very similar in behaviour to io_uring. I.e. it schedules tasks and waits for their completion instead of polling.

@dsisnero IOCP and io_uring are the very reason we have RFC 7 to abstract how the event loop is working internally.

1 Like

@MistressRemilia I canā€™t reproduce the performance hit on my Intel 14700K (Ubuntu 22.04, Linux 6.9, LLVM 18).

I downloaded the VGM from the ā€œtop 5 best ratedā€ on vgmrips (dragon saber, outrun, battle garegga, daytone usa + akumajo dracula). I removed the xx Marginal Consciousness (in game loop) file because it took forever to transcode.

I tried with 4 threads & jobs, then 20 (reached cpu cores) as well as 28 (reached cpu threads) threads & jobs, but I always get roughly identical timings (time spent & average sample), be it with the Crystal 1.14.0 release or the evloop branch :confused:

I also tried to replace the deque in timers for a minheap then a skiplist, but to no effect: I still reach the same overall performance in benben.

1 Like

What heap implementation? Regular binary heaps are typically not very fastā€¦ Not that it probably matters in any way whatsoever for this as as there was no difference measured.

Trying to figure out things like this is where it would be nice to have more powerful instrumentation so that it would be possible to account the time cost to different parts of the runtime or program.

Hrmā€¦ let me go back and thoroughly document my steps today. There must be some odd difference in how Iā€™m doing it.

@yxhuvud Theyā€™re actually super fastā€¦ well, mostly. Both Go and libevent use minheaps.

Iā€™m using a 4-heap (see d-ary heap). It has flawless performance on insert, great performance on delete-min, but arbitrary delete becomes painfully slow when the heap grows (over a 1000 timers) because lookups are slow (linear scans).

The main advantage is that a heap is basically an array, and a 4 heap considers 4 children (instead of 2) and we can take full advantage of CPU caches (not much thrashing).

The skiplist is nice, it has excellent performance on delete-min, arbitrary delete ainā€™t fast but never becomes painfully slow (thanks to the fast lookups) yet inserts are significantly slower than the above minheap because it must allocate, then following nodes is subject to CPU cache thrashingā€¦

Iā€™ll try to keep a list of free nodes. That may help with the allocations (but wonā€™t help with the cache thrashing) :cry: