While Gunnar Morling originally posed The One Billion Row Challenge for Java developers, over time folks have started implementing it in different languages which are showcased here.
When reviewing the results in Java I ran across one implementation in C that made me wonder what I could do with Crystal.
My implementation in Crystal is here: GitHub - nogginly/1brc.cr: Implementation of the "One Billion Rows Challenge" using Crystal.
I’ve been inspired by the many contributions at 1BRC and over time have incorporated several ideas. The following is a list of the different ways in which I’ve been able to speed up the Crystal implementation (
1brc_parallel.cr in the repo):
- unrolling parsing for the temperature
- chunking the parsing into parallel work
- using Crystal’s fibers and enabling multithreading
- use operators that ignore overflow checking
I did not however do the following which a lot of show & tell contributions:
did not use memory mapping for the file since (a) Crystal limits arrays to Int32 indices, (b) the most memory I have is 16GB (see update below)
- did not use vector processing or any other CPU-specific optimization
Regardless, to Crystal’s credit, my parallel implementation runs faster than some of the top contenders when I compared them on my laptops.
I’d love to know if there’s something more I can do to improve the performance. I’ve included my initial serial implementations as well.
Update, Feb 2
I’ve added a
Pointer-based variant called
1brc_parallel_ptr which runs 5% faster than
1brc_parallel which uses
Update, Feb 4
I added page-aligned buffer variants (
1brc_parallel_ptr2) and on a whim decided to benchmark them agains the prior forms. The variations include a change to the chunk sizing, all but last chunk have the same size, and the last chunk is smaller (remainder of a division).
These changes improved the overall performance.
Update, Feb 5
I’ve added a version that uses
mmap for loading the file, which turns out to improve performance on Linux but not so much on macOS.
On a PC with AMD Ryzen 7 7735HS CPU, 16 cores, 32 GB, and running Linux, comparing with some other
1brc contenders there’s still room to do. See the TODO list for changes so far and possible further improvements.
1brc_parallel_mmap 64 8
Relative performance was as follows:
1.09 ± 0.01 times faster than merykittyunsafe
1.20 ± 0.02 times faster than merykitty
2.54 ± 0.04 times faster than 1brc_parallel_mmap 64 8
Update, Feb 7
- Figured out why macOS was performing poorly with
mmap … memory! With only 16GB of RAM memory mapping a 13GB file (1B row) was causing problems (likely swapping).
- I switched to a 500 row (6.5GB) file which is large enough to be comparable while small enough to fit … and voila! The
mmap variant outperforms all others on macOS.
- In addition I implement a specialized
FxHashMap (inspired by the
merykitty implementation) after digging into the FxHash function (details here) in the Rust compiler.
- In the repo you will find three
b variants which all use the new
FxHashMap; in the the TODO you can see the comparison where the
b variants beat all their predecessors.
Relative performance is now as follows, with my implementation now only 1.87x slower (vs 2.58x as of two days ago).
1.02 ± 0.13 times faster than serkan-ozal
1.18 ± 0.03 times faster than merykittyunsafe
1.87 ± 0.01 times faster than 1brc_parallel_mmap2b 32 24 <----
2.58 ± 0.05 times faster than 1brc_parallel_mmap2 32 24
Wow this is awesome, I wanted to do it myself but no time. Thanks a lot!
Overcommiting on threads seems very odd. I would expect optimal saturation when the number of Crystal threads equals the number of hardware threads, thus the OS doesn’t need to shuffle and all threads can power through.
The number of fibers can be higher to take advantage of waiting time while reading from disk. Since all fibers are reading from the same file, I wouldn’t overdo it though. You could decrease the number of fibers and have each fiber run a loop to tackle the next piece of work when it has finished one task. This should be more efficient than massively overcommiting on fibers but limit their concurrency by passing buffers forward.
Memory mapping might indeed offer a quite a substantial performance increase.
What’s the problem with that? You don’t need to map the entire file at once. You can load individual slices, each up to
A smaller suggestion which might have some impact or not that much:
Slice# has an implicit range check. If you are sure the index is valid, you can use pointer arithmetics directly to avoid the check. This is similar to overflowing operators.
Right, good point. I “assumed” I had to map it all and didn’t actually look into how it worked.
And there’s so much good stuff in your overall note, I’m going to start a TODO list in the repo and start making new versions.
It does feel odd, and I remember the exact moment when I put in a high thread count as a lark while trying to figure out how to get higher multi-core utilization and couldn’t believe the result. There’s some pipelining effect happening because of the pattern of “IO burst followed by long pure in-memory compute” that seems to favour queuing a lot of work. When I have some more time I’ll dig into it. In the meantime I’ve put up more data (for i7 running on battery) showing the way performance reponds to thread count and chunk count (i.e. buffer division) here in the repo.
The current fastest implementation is in C# from what I know. Have you had a chance to run that on your laptop?
I have not run that. I have been using the Java submissions
merrykitty_unsafe by Quan Anh Mai,
- which per Gunnar run in 03.210s and 02.367s respectively on his benchmarking system,
- and run in 15.149s and 14.917s on my M1 Macbook
Based on that, my implementation, running in 8.646s on the M1, is about 40% faster.
It’s here if you wanna try. The author also made an extended input and I believe is keeping a leaderboard that you could add yours to
What? No. Reading from disk is a blocking operation and will block the thread until done. To get around that you need to use either a threadpool with more threads, or io_uring. That is assuming Linux is used. I’m not aware of what async possibilities of reading files asyncronously there are on mac or windows.
(if we want to deep dive into implementation specifics, the implementation use
wait_readable for disk io as well, but files are always readable - it is the actual
reading that is blocking and take time. This is a big difference to how network sockets work. It is also one of the main reasons it would be neat to have a io_uring based event machine)
Oh yeah. You’re right. File IO is indeed blocking (by default; most of the time). I was indeed misled by the mindset of sockets.
- Since IO read is blocking in a thread, it seems to make sense why I’m seeing better performance by running so many fibers relative to the threads; is that a reasonable conclusion?
- I’ve been playing with
mmap using a test app and was surprised to find it didn’t help on macOS. So I switched to a Linux machine and ran my test programs and … drum roll …
mmap won hands down.
- Essentially, using
mmap is a win for Linux and not so much for macOS in the situation where I’m chunking one big file and scanning it concurrently.
I hope to have an
mmap-based version of my Crystal
1brc app soon.
Do you have any good links for using
mmap from Crystal? I’ve been curious for a while, but haven’t found anything good.
After some research (aka Duck Duck Fu) I discovered that Crystal stdlib already has definitions for the
mmap API under
LibC. So, using man mmap as a reference to using the API, I was able to get the following very simply program opening a file and loading it via
file = File.new(ARGV.first, "r")
# mmap 1K of the file starting at offset 0
ptr = LibC.mmap(nil, 1024, LibC::PROT_READ, LibC::MAP_PRIVATE, file.fd, 0)
byteptr = Pointer(UInt8).new(ptr.address) # cast to byte pointer
buf = Bytes.new(byteptr, 1024, read_only: true) # wrap with a Slice(UInt8)
# do stuff
# clean up
This is just to illustrate how to use it. When you compile and run this with a file argument, it will
- open the file
mmap first 1K of the file to a pointer
- cast and wrap it into a
- print out the 1K
- clean up the mapping
- I used
PROT_READ but if you want to use
mmap to then modify the file you’ll need to OR
- I used
MAP_PRIVATE for exclusive use, but if you want to share etc, there are other
MAP_** options in the man page that are also available.
Posted latest update after adding an
mmap-based implementation. Turns out
mmap improves performance on Linux, but not so much on macOS. Since the challenge’s target was Linux, I’ll continue to test and post results for that OS.
As yo ucan see from my comparisons, I’m still notdoing as well as I would like. The
merykitty_unsafe Java app running on Open JDK 21 is beating my app hands down. To help figure out what was taking up time I commented out the code where I use a
Hash to store the stats and look them up by name.
Hash map use …
% time ./run.sh 1brc_parallel_mmap 64 8
./run.sh 1brc_parallel_mmap 64 8 17.58s user 0.66s system 1276% cpu 1.429 total
Hash map use …
% time ./run.sh 1brc_parallel_mmap 64 8
./run.sh 1brc_parallel_mmap 64 8 45.62s user 0.76s system 1443% cpu 3.214 total
That’s a ~2.24 x more time when storing the data, and an obvious candidate for my next attempts.