1BRC in Crystal

While Gunnar Morling originally posed The One Billion Row Challenge for Java developers, over time folks have started implementing it in different languages which are showcased here.

When reviewing the results in Java I ran across one implementation in C that made me wonder what I could do with Crystal.

My implementation in Crystal is here: GitHub - nogginly/1brc.cr: Implementation of the "One Billion Rows Challenge" using Crystal.

I’ve been inspired by the many contributions at 1BRC and over time have incorporated several ideas. The following is a list of the different ways in which I’ve been able to speed up the Crystal implementation (1brc_parallel.cr in the repo):

  • unrolling parsing for the temperature
  • chunking the parsing into parallel work
  • using Crystal’s fibers and enabling multithreading
  • use operators that ignore overflow checking

I did not however do the following which a lot of show & tell contributions:

  • did not use memory mapping for the file since (a) Crystal limits arrays to Int32 indices, (b) the most memory I have is 16GB (see update below)
  • did not use vector processing or any other CPU-specific optimization

Regardless, to Crystal’s credit, my parallel implementation runs faster than some of the top contenders when I compared them on my laptops.

I’d love to know if there’s something more I can do to improve the performance. I’ve included my initial serial implementations as well.

Update, Feb 2

I’ve added a Pointer-based variant called 1brc_parallel_ptr which runs 5% faster than 1brc_parallel which uses Bytes.

Update, Feb 4

I added page-aligned buffer variants (1brc_parallel2 and 1brc_parallel_ptr2) and on a whim decided to benchmark them agains the prior forms. The variations include a change to the chunk sizing, all but last chunk have the same size, and the last chunk is smaller (remainder of a division).

These changes improved the overall performance.

Update, Feb 5

  • I’ve added a version that uses mmap for loading the file, which turns out to improve performance on Linux but not so much on macOS.

  • On a PC with AMD Ryzen 7 7735HS CPU, 16 cores, 32 GB, and running Linux, comparing with some other 1brc contenders there’s still room to do. See the TODO list for changes so far and possible further improvements.

    command Lang mean stddev
    merykittyunsafe Java 1.38626720956 0.01020137999217643
    merykitty Java 1.52302498056 0.024776606613417875
    dannyvankooten/analyze C 1.26986208056 0.0027637210575434737
    1brc_parallel_mmap 64 8 Crystal 3.22288731556 0.047787448027210376

    Relative performance was as follows:

      dannyvankooten/analyze ran
        1.09 ± 0.01 times faster than merykittyunsafe
        1.20 ± 0.02 times faster than merykitty
        2.54 ± 0.04 times faster than 1brc_parallel_mmap 64 8
    

Update, Feb 7

  • Figured out why macOS was performing poorly with mmap … memory! With only 16GB of RAM memory mapping a 13GB file (1B row) was causing problems (likely swapping).
  • I switched to a 500 row (6.5GB) file which is large enough to be comparable while small enough to fit … and voila! The mmap variant outperforms all others on macOS.
  • In addition I implement a specialized FxHashMap (inspired by the merykitty implementation) after digging into the FxHash function (details here) in the Rust compiler.
  • In the repo you will find three b variants which all use the new FxHashMap; in the the TODO you can see the comparison where the b variants beat all their predecessors.

Relative performance is now as follows, with my implementation now only :slight_smile: 1.87x slower (vs 2.58x as of two days ago).

  dannyvankooten/analyze ran
    1.02 ± 0.13 times faster than serkan-ozal
    1.18 ± 0.03 times faster than merykittyunsafe
    1.87 ± 0.01 times faster than 1brc_parallel_mmap2b 32 24 <----
    2.58 ± 0.05 times faster than 1brc_parallel_mmap2 32 24
11 Likes

Wow this is awesome, I wanted to do it myself but no time. Thanks a lot!

2 Likes

Overcommiting on threads seems very odd. I would expect optimal saturation when the number of Crystal threads equals the number of hardware threads, thus the OS doesn’t need to shuffle and all threads can power through.
The number of fibers can be higher to take advantage of waiting time while reading from disk. Since all fibers are reading from the same file, I wouldn’t overdo it though. You could decrease the number of fibers and have each fiber run a loop to tackle the next piece of work when it has finished one task. This should be more efficient than massively overcommiting on fibers but limit their concurrency by passing buffers forward.
Memory mapping might indeed offer a quite a substantial performance increase.

What’s the problem with that? You don’t need to map the entire file at once. You can load individual slices, each up to Int32::MAX (or part_size).

A smaller suggestion which might have some impact or not that much: Slice#[] has an implicit range check. If you are sure the index is valid, you can use pointer arithmetics directly to avoid the check. This is similar to overflowing operators.

1 Like

Right, good point. I “assumed” I had to map it all and didn’t actually look into how it worked.

And there’s so much good stuff in your overall note, I’m going to start a TODO list in the repo and start making new versions. :pray:

It does feel odd, and I remember the exact moment when I put in a high thread count as a lark while trying to figure out how to get higher multi-core utilization and couldn’t believe the result. There’s some pipelining effect happening because of the pattern of “IO burst followed by long pure in-memory compute” that seems to favour queuing a lot of work. When I have some more time I’ll dig into it. In the meantime I’ve put up more data (for i7 running on battery) showing the way performance reponds to thread count and chunk count (i.e. buffer division) here in the repo.

1 Like

The current fastest implementation is in C# from what I know. Have you had a chance to run that on your laptop?

I have not run that. I have been using the Java submissions merykitty and merrykitty_unsafe by Quan Anh Mai,

  • which per Gunnar run in 03.210s and 02.367s respectively on his benchmarking system,
  • and run in 15.149s and 14.917s on my M1 Macbook

Based on that, my implementation, running in 8.646s on the M1, is about 40% faster.

Done.

It’s here if you wanna try. The author also made an extended input and I believe is keeping a leaderboard that you could add yours to

What? No. Reading from disk is a blocking operation and will block the thread until done. To get around that you need to use either a threadpool with more threads, or io_uring. That is assuming Linux is used. I’m not aware of what async possibilities of reading files asyncronously there are on mac or windows.

(if we want to deep dive into implementation specifics, the implementation use wait_readable for disk io as well, but files are always readable - it is the actual reading that is blocking and take time. This is a big difference to how network sockets work. It is also one of the main reasons it would be neat to have a io_uring based event machine)

1 Like

Oh yeah. You’re right. File IO is indeed blocking (by default; most of the time). I was indeed misled by the mindset of sockets.

A question

  • Since IO read is blocking in a thread, it seems to make sense why I’m seeing better performance by running so many fibers relative to the threads; is that a reasonable conclusion?

A discovery

  • I’ve been playing with mmap using a test app and was surprised to find it didn’t help on macOS. So I switched to a Linux machine and ran my test programs and … drum roll … mmap won hands down.
  • Essentially, using mmap is a win for Linux and not so much for macOS in the situation where I’m chunking one big file and scanning it concurrently.

I hope to have an mmap-based version of my Crystal 1brc app soon.

Do you have any good links for using mmap from Crystal? I’ve been curious for a while, but haven’t found anything good.

After some research (aka Duck Duck Fu) I discovered that Crystal stdlib already has definitions for the mmap API under LibC. So, using man mmap as a reference to using the API, I was able to get the following very simply program opening a file and loading it via mmap:

file = File.new(ARGV.first, "r")
# mmap 1K of the file starting at offset 0
ptr = LibC.mmap(nil, 1024, LibC::PROT_READ, LibC::MAP_PRIVATE, file.fd, 0)
byteptr = Pointer(UInt8).new(ptr.address)             # cast to byte pointer
buf = Bytes.new(byteptr, 1024, read_only: true)       # wrap with a Slice(UInt8)

# do stuff
STDOUT.write_string(buf)

# clean up
LibC.munmap(ptr, 1024)     

This is just to illustrate how to use it. When you compile and run this with a file argument, it will

  • open the file
  • mmap first 1K of the file to a pointer
  • cast and wrap it into a Bytes slice
  • print out the 1K
  • clean up the mapping

Also note:

  • I used PROT_READ but if you want to use mmap to then modify the file you’ll need to OR PROT_WRITE.
  • I used MAP_PRIVATE for exclusive use, but if you want to share etc, there are other MAP_** options in the man page that are also available.
3 Likes

Posted latest update after adding an mmap-based implementation. Turns out mmap improves performance on Linux, but not so much on macOS. Since the challenge’s target was Linux, I’ll continue to test and post results for that OS.

As yo ucan see from my comparisons, I’m still notdoing as well as I would like. The merykitty_unsafe Java app running on Open JDK 21 is beating my app hands down. To help figure out what was taking up time I commented out the code where I use a Hash to store the stats and look them up by name.

Without Hash map use …

% time ./run.sh 1brc_parallel_mmap 64 8
./run.sh 1brc_parallel_mmap 64 8 17.58s user 0.66s system 1276% cpu 1.429 total

With Hash map use …

% time ./run.sh 1brc_parallel_mmap 64 8
./run.sh 1brc_parallel_mmap 64 8 45.62s user 0.76s system 1443% cpu 3.214 total

That’s a ~2.24 x more time when storing the data, and an obvious candidate for my next attempts.

3 Likes

Updated, Feb 11

  • Switched to long word bitmasking for parsing the name
  • A small improvement by itself, my best is up to 1.74x slower from previous best at 1.84x slower.
  • Posting results because I’m starting to see parallelism match the availables cores.
  • Btw, for me at least, there is no significant different between my two best results; given the way the file is read in parallel and in chunks, using mmap or not doesn’t make a big difference.
    • This is notable only because it might mean the non-mmap version, which uses ~1GB of runtime RAM, is going to do well regardless of file size.

I realise there are even faster C#/C/SIMD/etc versions out there. And I haven’t run the official GraalVM-based Java winner yet either. At this point serkan-ozal is my goal. :smile:

relative command Lang mean (s) stddev
1.0x serkan-ozal Java 1.1661091954200002 0.10359078275160513
dannyvankooten/analyze C 1.24137285062 0.003261490844545827
merykittyunsafe.sh Java 1.34270993842 0.036888174169103186
merykitty.sh Java 1.51770504842 0.021277280568716347
1.74x slower 1brc_parallel_ptr4 16 48 Crystal 2.03081785402 0.06009663760371807
1.77x slower 1brc_parallel_mmap2b 16 32 Crystal 2.06915509142 0.01875921473395469
3 Likes