New Event Loop (UNIX): call for reviews & tests

Following the changes from RFC 7 (see RFC 0007: Event loop refactor), I started experimenting with epoll and kqueue, which led me to open multiple pull requests that quickly got superseded over the last 2 months.

The long development has come to an end: the final pull request is ready!

If there are any brave souls, you can try to review it (please) but be aware that the changes are significant :sweat_smile: Hopefully there should be enough comments and documentation and the order of commits can help to grasp some of the introduced concepts.

It most likely wonā€™t make it into 1.14 since weā€™re so close to the release. Even behind a preview flag, it might not make much sense to push something that might have issues in a release. Letā€™s polish it over the development of 1.15 instead.

Alternatively, thereā€™s nothing preventing you from testing the branch: checkout the branch and recompile your application, and hopefully :tada: ā€¦ otherwise please report regressions or issues. In the very best scenario (e.g. purely IO bound with long lived connections) you might experience up to +20% improvement , at worst it shouldnā€™t be slower than libevent.

8 Likes

Notes:

  1. The changes are significant yet there are many things I kept for future improvements, like a skiplist for timers, integrating signalfd / EVFILT_SIGNAL as well as pidfd / EVFILT_PROC, and more things. This is only the ā€œcoreā€ of more improvements to the event loop :sweat_smile:

  2. I documented myself on io_uring and have some preliminary design ideas about how to integrate it in Crystal, but that will have to wait. Iā€™m working on RFC 2 now, and trying to take advantage of that new evloop :grin:

1 Like

Thanks a lot @ysbaddaden :heart: Just compiled a 7K LOC Kemal application with this branch. Run some benchmarks with wrk and I didnā€™t see any errors, the performance is on par with Crystal 1.13.1 maybe %1 - %2 faster which is within error margin.

3 Likes

The magic door has finally openedļ¼Cheersļ¼

From ibidem:

ā€œMost application shouldnā€™t notice any impact because of this design choice, since a fd is usually not shared across fibers (concurrency issues), except maybe a server socket with multiple accepting fibers? In that case youā€™ll need to make sure the fibers are on the same thread (preview_mt) or same context (RFC 2).ā€

Does this mean that under certain conditions -Dpreview_mt is or becomes obsolete?

Yes. A new multithreading environment is coming up, as discussed in RFC 2. The event loop refactor is a prerequisite for that. Itā€™ll replace the old -Dpreview_mt.

1 Like

I tested this with the current Benben trunk code, about 71K LOC total, and things run/compile just fine. Performance seems to take a decent hit, though Iā€™m also comparing Crystal 1.12.2 to this PR. I compiled Crystal myself both times.

Rendering 79 VGM files to WAV, four in parallel on four cores:

  • Crystal v1.12.2: rendering took 03:28.087905842 with an avg samples/sec of 906,261
  • This PR: rendering took 03:49.727951144 with an avg samples/sec of 821,117

The machine:

  • Ryzen 3 3200U, 10GB RAM
  • Slackware 15.0
  • Kernel 6.10.11
  • LLVM 13
  • Samsung 970 EVO Plus SSD
4 Likes

Thanks. Yet another proof that reading helps.

@MistressRemilia Damn. I assume Benben is working with disk files, right? Meaning that it should never block (disk files are always ready) and thus never enter epoll. Maybe this is an impact of the EventLoop#read and #write methods that are slightly different? :thinking:

Would you have some reproducible steps? I donā€™t know where to find VGM files for instance.

1 Like

@MistressRemilia I checked out your code (nice, I learned a bit of Fossil), and you only use disk files, no sockets, and I canā€™t seem to pinpoint anything fundamentally different in both evloops to explain why it would be slower to read, write or close from files (itā€™s always blocking).

Then the next explanation would be the timers: sleeps (obviously), fiber yield (they create a 0 second timer). You donā€™t use IO timeouts or select with timeouts, but you do use sleep and Fiber.yield.

Theyā€™re the weak point of this evloop refactor: the Crystal::Evented::Timers is just a placeholder for a better data structure; itā€™s thread unsafe and is protected by a spin lock (not even a mutex) which creates a contention point for threads, etc.

Is this to be understood as in ā€œitā€™s known, and will changeā€ or as in ā€œitā€™s a limitation of the epoll eventloop refactor and we canā€™t do much about itā€?

1 Like

Iā€™m certain ~we~ heā€™ll fix it in a future PR.

3 Likes

Yeah, itā€™s only writing stuff to disk, no sockets or anything. Specifically, using --render will cause it to write rendered audio data to WAV (or Au) files. If you want some VGM files to test, check VGMRips (maybe the X68000 version of Castlevania since I use it as part of my benchmarking). Or just test it with FLAC/MP3/Ogg/WavPack/etc.

FWIW, I also checked it against 1.13.3 just now and the regression doesnā€™t appear in that version. So I think you may be on to something wit the Timers.

2 Likes