While Gunnar Morling originally posed The One Billion Row Challenge for Java developers, over time folks have started implementing it in different languages which are showcased here.
When reviewing the results in Java I ran across one implementation in C that made me wonder what I could do with Crystal.
I’ve been inspired by the many contributions at 1BRC and over time have incorporated several ideas. The following is a list of the different ways in which I’ve been able to speed up the Crystal implementation (1brc_parallel.cr in the repo):
unrolling parsing for the temperature
chunking the parsing into parallel work
using Crystal’s fibers and enabling multithreading
use operators that ignore overflow checking
I did not however do the following which a lot of show & tell contributions:
did not use memory mapping for the file since (a) Crystal limits arrays to Int32 indices, (b) the most memory I have is 16GB(see update below)
did not use vector processing or any other CPU-specific optimization
Regardless, to Crystal’s credit, my parallel implementation runs faster than some of the top contenders when I compared them on my laptops.
I’d love to know if there’s something more I can do to improve the performance. I’ve included my initial serial implementations as well.
Update, Feb 2
I’ve added a Pointer-based variant called 1brc_parallel_ptr which runs 5% faster than 1brc_parallel which uses Bytes.
Update, Feb 4
I added page-aligned buffer variants (1brc_parallel2 and 1brc_parallel_ptr2) and on a whim decided to benchmark them agains the prior forms. The variations include a change to the chunk sizing, all but last chunk have the same size, and the last chunk is smaller (remainder of a division).
These changes improved the overall performance.
Update, Feb 5
I’ve added a version that uses mmap for loading the file, which turns out to improve performance on Linux but not so much on macOS.
On a PC with AMD Ryzen 7 7735HS CPU, 16 cores, 32 GB, and running Linux, comparing with some other 1brc contenders there’s still room to do. See the TODO list for changes so far and possible further improvements.
command
Lang
mean
stddev
merykittyunsafe
Java
1.38626720956
0.01020137999217643
merykitty
Java
1.52302498056
0.024776606613417875
dannyvankooten/analyze
C
1.26986208056
0.0027637210575434737
1brc_parallel_mmap 64 8
Crystal
3.22288731556
0.047787448027210376
Relative performance was as follows:
dannyvankooten/analyze ran
1.09 ± 0.01 times faster than merykittyunsafe
1.20 ± 0.02 times faster than merykitty
2.54 ± 0.04 times faster than 1brc_parallel_mmap 64 8
Update, Feb 7
Figured out why macOS was performing poorly with mmap … memory! With only 16GB of RAM memory mapping a 13GB file (1B row) was causing problems (likely swapping).
I switched to a 500 row (6.5GB) file which is large enough to be comparable while small enough to fit … and voila! The mmap variant outperforms all others on macOS.
In addition I implement a specialized FxHashMap (inspired by the merykitty implementation) after digging into the FxHash function (details here) in the Rust compiler.
In the repo you will find three b variants which all use the new FxHashMap; in the the TODO you can see the comparison where the b variants beat all their predecessors.
Relative performance is now as follows, with my implementation now only 1.87x slower (vs 2.58x as of two days ago).
dannyvankooten/analyze ran
1.02 ± 0.13 times faster than serkan-ozal
1.18 ± 0.03 times faster than merykittyunsafe
1.87 ± 0.01 times faster than 1brc_parallel_mmap2b 32 24 <----
2.58 ± 0.05 times faster than 1brc_parallel_mmap2 32 24
Overcommiting on threads seems very odd. I would expect optimal saturation when the number of Crystal threads equals the number of hardware threads, thus the OS doesn’t need to shuffle and all threads can power through.
The number of fibers can be higher to take advantage of waiting time while reading from disk. Since all fibers are reading from the same file, I wouldn’t overdo it though. You could decrease the number of fibers and have each fiber run a loop to tackle the next piece of work when it has finished one task. This should be more efficient than massively overcommiting on fibers but limit their concurrency by passing buffers forward.
Memory mapping might indeed offer a quite a substantial performance increase.
What’s the problem with that? You don’t need to map the entire file at once. You can load individual slices, each up to Int32::MAX (or part_size).
A smaller suggestion which might have some impact or not that much: Slice#[] has an implicit range check. If you are sure the index is valid, you can use pointer arithmetics directly to avoid the check. This is similar to overflowing operators.
It does feel odd, and I remember the exact moment when I put in a high thread count as a lark while trying to figure out how to get higher multi-core utilization and couldn’t believe the result. There’s some pipelining effect happening because of the pattern of “IO burst followed by long pure in-memory compute” that seems to favour queuing a lot of work. When I have some more time I’ll dig into it. In the meantime I’ve put up more data (for i7 running on battery) showing the way performance reponds to thread count and chunk count (i.e. buffer division) here in the repo.
What? No. Reading from disk is a blocking operation and will block the thread until done. To get around that you need to use either a threadpool with more threads, or io_uring. That is assuming Linux is used. I’m not aware of what async possibilities of reading files asyncronously there are on mac or windows.
(if we want to deep dive into implementation specifics, the implementation use wait_readable for disk io as well, but files are always readable - it is the actual reading that is blocking and take time. This is a big difference to how network sockets work. It is also one of the main reasons it would be neat to have a io_uring based event machine)
Since IO read is blocking in a thread, it seems to make sense why I’m seeing better performance by running so many fibers relative to the threads; is that a reasonable conclusion?
A discovery
I’ve been playing with mmap using a test app and was surprised to find it didn’t help on macOS. So I switched to a Linux machine and ran my test programs and … drum roll … mmap won hands down.
Essentially, using mmap is a win for Linux and not so much for macOS in the situation where I’m chunking one big file and scanning it concurrently.
I hope to have an mmap-based version of my Crystal 1brc app soon.
After some research (aka Duck Duck Fu) I discovered that Crystal stdlib already has definitions for the mmap API under LibC. So, using man mmap as a reference to using the API, I was able to get the following very simply program opening a file and loading it via mmap:
file = File.new(ARGV.first, "r")
# mmap 1K of the file starting at offset 0
ptr = LibC.mmap(nil, 1024, LibC::PROT_READ, LibC::MAP_PRIVATE, file.fd, 0)
byteptr = Pointer(UInt8).new(ptr.address) # cast to byte pointer
buf = Bytes.new(byteptr, 1024, read_only: true) # wrap with a Slice(UInt8)
# do stuff
STDOUT.write_string(buf)
# clean up
LibC.munmap(ptr, 1024)
This is just to illustrate how to use it. When you compile and run this with a file argument, it will
open the file
mmap first 1K of the file to a pointer
cast and wrap it into a Bytes slice
print out the 1K
clean up the mapping
Also note:
I used PROT_READ but if you want to use mmap to then modify the file you’ll need to OR PROT_WRITE.
I used MAP_PRIVATE for exclusive use, but if you want to share etc, there are other MAP_** options in the man page that are also available.
Posted latest update after adding an mmap-based implementation. Turns out mmap improves performance on Linux, but not so much on macOS. Since the challenge’s target was Linux, I’ll continue to test and post results for that OS.
As yo ucan see from my comparisons, I’m still notdoing as well as I would like. The merykitty_unsafe Java app running on Open JDK 21 is beating my app hands down. To help figure out what was taking up time I commented out the code where I use a Hash to store the stats and look them up by name.
Without Hash map use …
% time ./run.sh 1brc_parallel_mmap 64 8
./run.sh 1brc_parallel_mmap 64 8 17.58s user 0.66s system 1276% cpu 1.429 total
With Hash map use …
% time ./run.sh 1brc_parallel_mmap 64 8
./run.sh 1brc_parallel_mmap 64 8 45.62s user 0.76s system 1443% cpu 3.214 total
That’s a ~2.24 x more time when storing the data, and an obvious candidate for my next attempts.
Switched to long word bitmasking for parsing the name
A small improvement by itself, my best is up to 1.74x slower from previous best at 1.84x slower.
Posting results because I’m starting to see parallelism match the availables cores.
Btw, for me at least, there is no significant different between my two best results; given the way the file is read in parallel and in chunks, using mmap or not doesn’t make a big difference.
This is notable only because it might mean the non-mmap version, which uses ~1GB of runtime RAM, is going to do well regardless of file size.
I realise there are even faster C#/C/SIMD/etc versions out there. And I haven’t run the official GraalVM-based Java winner yet either. At this point serkan-ozal is my goal.