Following the changes from RFC 7 (see RFC 0007: Event loop refactor), I started experimenting with epoll and kqueue, which led me to open multiple pull requests that quickly got superseded over the last 2 months.
The long development has come to an end: the final pull request is ready!
If there are any brave souls, you can try to review it (please) but be aware that the changes are significant Hopefully there should be enough comments and documentation and the order of commits can help to grasp some of the introduced concepts.
It most likely wonāt make it into 1.14 since weāre so close to the release. Even behind a preview flag, it might not make much sense to push something that might have issues in a release. Letās polish it over the development of 1.15 instead.
Alternatively, thereās nothing preventing you from testing the branch: checkout the branch and recompile your application, and hopefully ā¦ otherwise please report regressions or issues. In the very best scenario (e.g. purely IO bound with long lived connections) you might experience up to +20% improvement , at worst it shouldnāt be slower than libevent.
The changes are significant yet there are many things I kept for future improvements, like a skiplist for timers, integrating signalfd / EVFILT_SIGNAL as well as pidfd / EVFILT_PROC, and more things. This is only the ācoreā of more improvements to the event loop
I documented myself on io_uring and have some preliminary design ideas about how to integrate it in Crystal, but that will have to wait. Iām working on RFC 2 now, and trying to take advantage of that new evloop
Thanks a lot @ysbaddaden Just compiled a 7K LOC Kemal application with this branch. Run some benchmarks with wrk and I didnāt see any errors, the performance is on par with Crystal 1.13.1 maybe %1 - %2 faster which is within error margin.
āMost application shouldnāt notice any impact because of this design choice, since a fd is usually not shared across fibers (concurrency issues), except maybe a server socket with multiple accepting fibers? In that case youāll need to make sure the fibers are on the same thread (preview_mt) or same context (RFC 2).ā
Does this mean that under certain conditions -Dpreview_mt is or becomes obsolete?
Yes. A new multithreading environment is coming up, as discussed in RFC 2. The event loop refactor is a prerequisite for that. Itāll replace the old -Dpreview_mt.
I tested this with the current Benben trunk code, about 71K LOC total, and things run/compile just fine. Performance seems to take a decent hit, though Iām also comparing Crystal 1.12.2 to this PR. I compiled Crystal myself both times.
Rendering 79 VGM files to WAV, four in parallel on four cores:
Crystal v1.12.2: rendering took 03:28.087905842 with an avg samples/sec of 906,261
This PR: rendering took 03:49.727951144 with an avg samples/sec of 821,117
@MistressRemilia Damn. I assume Benben is working with disk files, right? Meaning that it should never block (disk files are always ready) and thus never enter epoll. Maybe this is an impact of the EventLoop#read and #write methods that are slightly different?
Would you have some reproducible steps? I donāt know where to find VGM files for instance.
@MistressRemilia I checked out your code (nice, I learned a bit of Fossil), and you only use disk files, no sockets, and I canāt seem to pinpoint anything fundamentally different in both evloops to explain why it would be slower to read, write or close from files (itās always blocking).
Then the next explanation would be the timers: sleeps (obviously), fiber yield (they create a 0 second timer). You donāt use IO timeouts or select with timeouts, but you do use sleep and Fiber.yield.
Theyāre the weak point of this evloop refactor: the Crystal::Evented::Timers is just a placeholder for a better data structure; itās thread unsafe and is protected by a spin lock (not even a mutex) which creates a contention point for threads, etc.
Is this to be understood as in āitās known, and will changeā or as in āitās a limitation of the epoll eventloop refactor and we canāt do much about itā?
Yeah, itās only writing stuff to disk, no sockets or anything. Specifically, using --render will cause it to write rendered audio data to WAV (or Au) files. If you want some VGM files to test, check VGMRips (maybe the X68000 version of Castlevania since I use it as part of my benchmarking). Or just test it with FLAC/MP3/Ogg/WavPack/etc.
FWIW, I also checked it against 1.13.3 just now and the regression doesnāt appear in that version. So I think you may be on to something wit the Timers.
The architecture is already designed with io_uring support in mind. Itās not implemented yet, but we know what it needs.
In fact, the IOCP-based implementation on Windows is already very similar in behaviour to io_uring. I.e. it schedules tasks and waits for their completion instead of polling.
@MistressRemilia I canāt reproduce the performance hit on my Intel 14700K (Ubuntu 22.04, Linux 6.9, LLVM 18).
I downloaded the VGM from the ātop 5 best ratedā on vgmrips (dragon saber, outrun, battle garegga, daytone usa + akumajo dracula). I removed the xx Marginal Consciousness (in game loop) file because it took forever to transcode.
I tried with 4 threads & jobs, then 20 (reached cpu cores) as well as 28 (reached cpu threads) threads & jobs, but I always get roughly identical timings (time spent & average sample), be it with the Crystal 1.14.0 release or the evloop branch
I also tried to replace the deque in timers for a minheap then a skiplist, but to no effect: I still reach the same overall performance in benben.
What heap implementation? Regular binary heaps are typically not very fastā¦ Not that it probably matters in any way whatsoever for this as as there was no difference measured.
Trying to figure out things like this is where it would be nice to have more powerful instrumentation so that it would be possible to account the time cost to different parts of the runtime or program.
@yxhuvud Theyāre actually super fastā¦ well, mostly. Both Go and libevent use minheaps.
Iām using a 4-heap (see d-ary heap). It has flawless performance on insert, great performance on delete-min, but arbitrary delete becomes painfully slow when the heap grows (over a 1000 timers) because lookups are slow (linear scans).
The main advantage is that a heap is basically an array, and a 4 heap considers 4 children (instead of 2) and we can take full advantage of CPU caches (not much thrashing).
The skiplist is nice, it has excellent performance on delete-min, arbitrary delete aināt fast but never becomes painfully slow (thanks to the fast lookups) yet inserts are significantly slower than the above minheap because it must allocate, then following nodes is subject to CPU cache thrashingā¦
Iāll try to keep a list of free nodes. That may help with the allocations (but wonāt help with the cache thrashing)