Faster floating point parsing algorithm

I saw that latest Rust 1.55 released today (Th Sep 9, 2021), and looking at the release notes saw they’re using a faster and more accurate floating point number parsing algorithm.

I have no idea if this has any relevance to Crystal, but it seemed interesting and potentially useful. So here are some links to it.

6 Likes

crystal has more important things than this optimization.

1 Like

I agree, but if someone has interest in writing a PR for faster float parsing, why not?

6 Likes

@jzakiya Nice find! When writing my Redis client, the way I parsed integers was a surprisingly important part of keeping it fast:

➜  Code crystal run --release bench_parse_int.cr
String#to_i  24.83M ( 40.28ns) (± 1.58%)  32.0B/op   3.33× slower
byte parser  82.70M ( 12.09ns) (± 0.92%)   0.0B/op        fastest
Benchmark code
require "benchmark"

io = IO::Memory.new("12345678\n")
value = nil

Benchmark.ips do |x|
  x.report "String#to_i" do
    value = io.rewind.read_line.to_i
  end

  x.report "byte parser" do
    value = parse_int(io.rewind)
  end
end
pp value

def parse_int(io)
  int = 0i64
  negative = false
  loop do
    if peek = io.peek
      case next_byte = peek[0] 
      when nil
        break
      when '-'
        negative = true
        io.skip 1
      when '0'.ord..'9'.ord
        int = (int * 10) + (next_byte - '0'.ord)
        io.skip 1
      else
        break
      end
    else
      break
    end
  end

  if negative
    -int
  else
    int
  end
end

This doesn’t look like quite the same thing (if I understand it correctly, it looks like it’s changing the implementation of the existing parser for numeric strings already in memory), but if the performance benefits can be similarly impactful, this would be really nice for things like parsing JSON or even HTTP query/form params. Any text-based float parsing would benefit.

@pynixwang The OP didn’t deserve this response. Contributors can scratch their own itches. The core team doesn’t have to do it themselves. Besides, low-level optimizations can yield huge performance benefits in a hot loop.

7 Likes

Looks like crystal uses LibC.strtod
It would be interesting to see a benchmark. Or to fix LibC? LOL

If your integer parsing code is faster maybe you could PR that as well to crystal core? :)

2 Likes

My guess is that Rust didn’t depend on LibC for parsing floats, they had their own routine. Then they optimized it, still in Rust. I doubt that will be faster than old libc’s strod, but I don’t know.

2 Likes

@jgaskins What happens if the IO is not peakable? I think that scenario isn’t handled in your snippet. Though I guess in real-life the IO is always a socket, and it’s buffered, so maybe not something to worry about.

1 Like

Correct, and by choice. This was a purpose-built parser so I was able to make some assumptions to gain speed without sacrificing correctness. The Redis server is guaranteed to provide properly formatted numbers (notice I also didn’t check where the negative sign is or how many there are :-D) over a socket.

The idea was to point out that improving the performance of parsing numbers can improve overall performance. In the case of Redis, every single value in the Redis protocol contains a number, so optimizing that one thing was a big force multiplier.

I had the luxury of skipping formatting checks and the internal buffering required to check the next byte nondestructively in unbuffered I/O (and the CPU cycles that come with those things), but the stdlib unfortunately would not. It would have to account for plenty of edge cases I was able to ignore. :-)

FWIW, if it’s possible, a PR for parsing numbers from a text-based I/O without a single heap allocation would be amazing. I just don’t know if it’s feasible.

1 Like

@jgaskins Do you have the code for bench_parse_int.cr? I want to try something…

1 Like

It’s in a <details> element below the results.

1 Like

Thanks!

I tried this code and it seems slightly faster:

    def parse_int
      int = 0i64
      negative = false
      peek = @io.peek

      while peek && !peek.empty?
        peek.each_with_index do |byte, index|
          case byte
          when '-'
            negative = true
          when '0'.ord..'9'.ord
            int = (int * 10) + (byte - '0'.ord)
          else
            @io.skip(index)
            return negative ? -int : int
          end
        end
        @io.skip(peek.size)
        peek = @io.peek
      end

      negative ? -int : int
    end

The reason is that IO#peek will check if the IO is open on every call, and I think IO::Buffered does a few other checks.

Then I also tried this once and it’s sliiiiiiiiightly faster, but maybe it’s too much or too verbose:

    def parse_int
      int = 0i64
      negative = false
      peek = @io.peek

      while peek && !peek.empty?
        peek.each_with_index do |byte, index|
          case byte
          when '-'.ord then negative = true
          when '0'.ord then int = int * 10
          when '1'.ord then int = int * 10 + 1
          when '2'.ord then int = int * 10 + 2
          when '3'.ord then int = int * 10 + 3
          when '4'.ord then int = int * 10 + 4
          when '5'.ord then int = int * 10 + 5
          when '6'.ord then int = int * 10 + 6
          when '7'.ord then int = int * 10 + 7
          when '8'.ord then int = int * 10 + 8
          when '9'.ord then int = int * 10 + 9
          else
            @io.skip(index)
            return negative ? -int : int
          end
        end
        @io.skip(peek.size)
        peek = @io.peek
      end

      negative ? -int : int
    end

Could you try these on your end to see what performance benefit you get?

Also, I think the algorithms are correct, but I’m not sure! Specs pass though…

1 Like

Actually, this seems faster and simpler, and it’s about twice as fast as the original algorithm :smiley:

    def parse_int
      int = 0i64
      negative = false
      peek = @io.peek

      if peek && !peek.empty? && peek[0] === '-'
        negative = true
        @io.skip(1)
        peek = @io.peek
      end

      while peek && !peek.empty?
        peek.each_with_index do |byte, index|
          if '0'.ord <= byte <= '9'.ord
            int = int * 10 + (byte - '0'.ord)
          else
            @io.skip(index)
            return negative ? -int : int
          end
        end
        @io.skip(peek.size)
        peek = @io.peek
      end

      negative ? -int : int
    end
8 Likes

That said… no idea if this would impact a real benchmark of the entire redis client. Maybe not allocating memory when parsing a int is enough, and these micro-optimizations won’t change that overall benchmark.

1 Like

This is really impressive! I love that you get even more nerd-sniped by stuff like this than I do. :-D It’s a very specific kind of inspiration and I appreciate it so much.

I’m traveling at the moment so I don’t have access to an Intel machine but even on the Apple M1 (where sys calls often have reduced impact) it shows significant gains:

➜  redis git:(master) ✗ crystal run --release bench/parse_int.cr
      String#to_i  25.17M ( 39.73ns) (± 0.79%)  32.0B/op   4.59× slower
      byte parser  82.46M ( 12.13ns) (± 0.50%)   0.0B/op   1.40× slower
ary's byte parser 115.48M (  8.66ns) (± 0.57%)   0.0B/op        fastest

Indeed, I’ve been meaning to put together some more end-to-end benchmarks. What I’ve mainly been using was the benchmark I wrote for this article since they’re the most common Redis operations for apps I tend to work on: getting/setting session/cache data and incrementing counters.

2 Likes

I just ran the code in the article with both versions: they ran at the same speed :smiley:

3 Likes

Thank you! Optimizing things is the thing I enjoy most when programming.

12 Likes