Faster floating point parsing algorithm

jzakiya · September 9, 2021, 10:32pm

I saw that latest Rust 1.55 released today (Th Sep 9, 2021), and looking at the release notes saw they’re using a faster and more accurate floating point number parsing algorithm.

I have no idea if this has any relevance to Crystal, but it seemed interesting and potentially useful. So here are some links to it.

github.com

rust-lang/rust/blob/master/RELEASES.md#version-55-2021-09-09

Version 1.55.0 (2021-09-09)
============================

Language
--------
- [You can now write open "from" range patterns (`X..`), which will start at `X` and
  will end at the maximum value of the integer.][83918]
- [You can now explicitly import the prelude of different editions
  through `std::prelude` (e.g. `use std::prelude::rust_2021::*;`).][86294]

Compiler
--------
- [Added tier 3\* support for `powerpc64le-unknown-freebsd`.][83572]

\* Refer to Rust's [platform support page][platform-support-doc] for more
   information on Rust's tiered platform support.

Libraries
---------

This file has been truncated. show original

github.com/rust-lang/rust

Update Rust Float-Parsing Algorithms to use the Eisel-Lemire algorithm.

rust-lang:master ← Alexhuszagh:master

opened 09:44PM - 30 Jun 21 UTC

Alexhuszagh

+2530 -2823

# Summary Rust, although it implements a correct float parser, has major perf…ormance issues in float parsing. Even for common floats, the performance can be 3-10x [slower](https://arxiv.org/pdf/2101.11408.pdf) than external libraries such as [lexical](https://github.com/Alexhuszagh/rust-lexical) and [fast-float-rust](https://github.com/aldanor/fast-float-rust). Recently, major advances in float-parsing algorithms have been developed by Daniel Lemire, along with others, and implement a fast, performant, and correct float parser, with speeds up to 1200 MiB/s on Apple's M1 architecture for the [canada](https://github.com/lemire/simple_fastfloat_benchmark/blob/0e2b5d163d4074cc0bde2acdaae78546d6e5c5f1/data/canada.txt) dataset, 10x faster than Rust's 130 MiB/s. In addition, [edge-cases](https://github.com/rust-lang/rust/issues/85234) in Rust's [dec2flt](https://github.com/rust-lang/rust/tree/868c702d0c9a471a28fb55f0148eb1e3e8b1dcc5/library/core/src/num/dec2flt) algorithm can lead to over a 1600x slowdown relative to efficient algorithms. This is due to the use of Clinger's correct, but slow [AlgorithmM and Bellepheron](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.45.4152&rep=rep1&type=pdf), which have been improved by faster big-integer algorithms and the Eisel-Lemire algorithm, respectively. Finally, this algorithm provides substantial improvements in the number of floats the Rust core library can parse. Denormal floats with a large number of digits cannot be parsed, due to use of the `Big32x40`, which simply does not have enough digits to round a float correctly. Using a custom decimal class, with much simpler logic, we can parse all valid decimal strings of any digit count. ```rust // Issue in Rust's dec2fly. "2.47032822920623272088284396434110686182e-324".parse::<f64>(); // Err(ParseFloatError { kind: Invalid }) ``` # Solution This pull request implements the Eisel-Lemire algorithm, modified from [fast-float-rust](https://github.com/aldanor/fast-float-rust) (which is licensed under Apache 2.0/MIT), along with numerous modifications to make it more amenable to inclusion in the Rust core library. The following describes both features in fast-float-rust and improvements in fast-float-rust for inclusion in core. **Documentation** Extensive documentation has been added to ensure the code base may be maintained by others, which explains the algorithms as well as various associated constants and routines. For example, two seemingly magical constants include documentation to describe how they were derived as follows: ```rust // Round-to-even only happens for negative values of q // when q ≥ −4 in the 64-bit case and when q ≥ −17 in // the 32-bitcase. // // When q ≥ 0,we have that 5^q ≤ 2m+1. In the 64-bit case,we // have 5^q ≤ 2m+1 ≤ 2^54 or q ≤ 23. In the 32-bit case,we have // 5^q ≤ 2m+1 ≤ 2^25 or q ≤ 10. // // When q < 0, we have w ≥ (2m+1)×5^−q. We must have that w < 2^64 // so (2m+1)×5^−q < 2^64. We have that 2m+1 > 2^53 (64-bit case) // or 2m+1 > 2^24 (32-bit case). Hence,we must have 2^53×5^−q < 2^64 // (64-bit) and 2^24×5^−q < 2^64 (32-bit). Hence we have 5^−q < 2^11 // or q ≥ −4 (64-bit case) and 5^−q < 2^40 or q ≥ −17 (32-bitcase). // // Thus we have that we only need to round ties to even when // we have that q ∈ [−4,23](in the 64-bit case) or q∈[−17,10] // (in the 32-bit case). In both cases,the power of five(5^|q|) // fits in a 64-bit word. const MIN_EXPONENT_ROUND_TO_EVEN: i32; const MAX_EXPONENT_ROUND_TO_EVEN: i32; ``` This ensures maintainability of the code base. **Improvements for Disguised Fast-Path Cases** The fast path in float parsing algorithms attempts to use native, machine floats to represent both the significant digits and the exponent, which is only possible if both can be exactly represented without rounding. In practice, this means that the significant digits must be 53-bits or less and the then exponent must be in the range `[-22, 22]` (for an f64). This is similar to the existing dec2flt implementation. However, disguised fast-path cases exist, where there are few significant digits and an exponent above the valid range, such as `1.23e25`. In this case, powers-of-10 may be shifted from the exponent to the significant digits, discussed at length in https://github.com/rust-lang/rust/issues/85198. **Digit Parsing Improvements** Typically, integers are parsed from string 1-at-a-time, requiring unnecessary multiplications which can slow down parsing. An approach to parse 8 digits at a time using only 3 multiplications is described in length [here](https://johnnylee-sde.github.io/Fast-numeric-string-to-int/). This leads to significant performance improvements, and is implemented for both big and little-endian systems. **Unsafe Changes** Relative to fast-float-rust, this library makes less use of unsafe functionality and clearly documents it. This includes the refactoring and documentation of numerous unsafe methods undesirably marked as safe. The original code would look something like this, which is deceptively marked as safe for unsafe functionality. ```rust impl AsciiStr { #[inline] pub fn step_by(&mut self, n: usize) -> &mut Self { unsafe { self.ptr = self.ptr.add(n) }; self } } ... #[inline] fn parse_scientific(s: &mut AsciiStr<'_>) -> i64 { // the first character is 'e'/'E' and scientific mode is enabled let start = *s; s.step(); ... } ``` The new code clearly documents safety concerns, and does not mark unsafe functionality as safe, leading to better safety guarantees. ```rust impl AsciiStr { /// Advance the view by n, advancing it in-place to (n..). pub unsafe fn step_by(&mut self, n: usize) -> &mut Self { // SAFETY: same as step_by, safe as long n is less than the buffer length self.ptr = unsafe { self.ptr.add(n) }; self } } ... /// Parse the scientific notation component of a float. fn parse_scientific(s: &mut AsciiStr<'_>) -> i64 { let start = *s; // SAFETY: the first character is 'e'/'E' and scientific mode is enabled unsafe { s.step(); } ... } ``` This allows us to trivially demonstrate the new implementation of dec2flt is safe. **Inline Annotations Have Been Removed** In the previous implementation of dec2flt, inline annotations exist practically nowhere in the entire module. Therefore, these annotations have been removed, which mostly does not impact [performance](https://github.com/aldanor/fast-float-rust/issues/15#issuecomment-864485157). **Fixed Correctness Tests** Numerous compile errors in `src/etc/test-float-parse` were present, due to deprecation of `time.clock()`, as well as the crate dependencies with `rand`. The tests have therefore been reworked as a [crate](https://github.com/Alexhuszagh/rust/tree/master/src/etc/test-float-parse), and any errors in `runtests.py` have been patched. **Undefined Behavior** An implementation of `check_len` which relied on undefined behavior (in fast-float-rust) has been refactored, to ensure that the behavior is well-defined. The original code is as follows: ```rust #[inline] pub fn check_len(&self, n: usize) -> bool { unsafe { self.ptr.add(n) <= self.end } } ``` And the new implementation is as follows: ```rust /// Check if the slice at least `n` length. fn check_len(&self, n: usize) -> bool { n <= self.as_ref().len() } ``` Note that this has since been fixed in [fast-float-rust](https://github.com/aldanor/fast-float-rust/pull/29). **Inferring Binary Exponents** Rather than explicitly store binary exponents, this new implementation infers them from the decimal exponent, reducing the amount of static storage required. This removes the requirement to store [611 i16s](https://github.com/rust-lang/rust/blob/868c702d0c9a471a28fb55f0148eb1e3e8b1dcc5/library/core/src/num/dec2flt/table.rs#L8). # Code Size The code size, for all optimizations, does not considerably change relative to before for stripped builds, however it is **significantly** smaller prior to stripping the resulting binaries. These binary sizes were calculated on x86_64-unknown-linux-gnu. **new** Using rustc version 1.55.0-dev. opt-level|size|size(stripped) |:-:|:-:|:-:| 0|400k|300K 1|396k|292K 2|392k|292K 3|392k|296K s|396k|292K z|396k|292K **old** Using rustc version 1.53.0-nightly. opt-level|size|size(stripped) |:-:|:-:|:-:| 0|3.2M|304K 1|3.2M|292K 2|3.1M|284K 3|3.1M|284K s|3.1M|284K z|3.1M|284K # Correctness The dec2flt implementation passes all of Rust's unittests and comprehensive float parsing tests, along with numerous other tests such as Nigel Toa's comprehensive float [tests](https://github.com/nigeltao/parse-number-fxx-test-data) and Hrvoje Abraham [strtod_tests](https://github.com/ahrvoje/numerics/blob/master/strtod/strtod_tests.toml). Therefore, it is unlikely that this algorithm will incorrectly round parsed floats. # Issues Addressed This will fix and close the following issues: - resolves #85198 - resolves #85214 - resolves #85234 - fixes #31407 - fixes #31109 - fixes #53015 - resolves #68396 - closes https://github.com/aldanor/fast-float-rust/issues/15

pynixwang · September 13, 2021, 3:21pm

crystal has more important things than this optimization.

RespiteSage · September 13, 2021, 3:46pm

I agree, but if someone has interest in writing a PR for faster float parsing, why not?

jgaskins · September 14, 2021, 3:19am

@jzakiya Nice find! When writing my Redis client, the way I parsed integers was a surprisingly important part of keeping it fast:

➜  Code crystal run --release bench_parse_int.cr
String#to_i  24.83M ( 40.28ns) (± 1.58%)  32.0B/op   3.33× slower
byte parser  82.70M ( 12.09ns) (± 0.92%)   0.0B/op        fastest

Benchmark code

require "benchmark"

io = IO::Memory.new("12345678\n")
value = nil

Benchmark.ips do |x|
  x.report "String#to_i" do
    value = io.rewind.read_line.to_i
  end

  x.report "byte parser" do
    value = parse_int(io.rewind)
  end
end
pp value

def parse_int(io)
  int = 0i64
  negative = false
  loop do
    if peek = io.peek
      case next_byte = peek[0] 
      when nil
        break
      when '-'
        negative = true
        io.skip 1
      when '0'.ord..'9'.ord
        int = (int * 10) + (next_byte - '0'.ord)
        io.skip 1
      else
        break
      end
    else
      break
    end
  end

  if negative
    -int
  else
    int
  end
end

This doesn’t look like quite the same thing (if I understand it correctly, it looks like it’s changing the implementation of the existing parser for numeric strings already in memory), but if the performance benefits can be similarly impactful, this would be really nice for things like parsing JSON or even HTTP query/form params. Any text-based float parsing would benefit.

@pynixwang The OP didn’t deserve this response. Contributors can scratch their own itches. The core team doesn’t have to do it themselves. Besides, low-level optimizations can yield huge performance benefits in a hot loop.

rogerdpack · September 14, 2021, 5:45pm

Looks like crystal uses LibC.strtod
It would be interesting to see a benchmark. Or to fix LibC? LOL

If your integer parsing code is faster maybe you could PR that as well to crystal core? :)

asterite · September 14, 2021, 6:04pm

My guess is that Rust didn’t depend on LibC for parsing floats, they had their own routine. Then they optimized it, still in Rust. I doubt that will be faster than old libc’s strod, but I don’t know.

asterite · September 14, 2021, 6:05pm

@jgaskins What happens if the IO is not peakable? I think that scenario isn’t handled in your snippet. Though I guess in real-life the IO is always a socket, and it’s buffered, so maybe not something to worry about.

jgaskins · September 14, 2021, 8:00pm

Correct, and by choice. This was a purpose-built parser so I was able to make some assumptions to gain speed without sacrificing correctness. The Redis server is guaranteed to provide properly formatted numbers (notice I also didn’t check where the negative sign is or how many there are :-D) over a socket.

The idea was to point out that improving the performance of parsing numbers can improve overall performance. In the case of Redis, every single value in the Redis protocol contains a number, so optimizing that one thing was a big force multiplier.

I had the luxury of skipping formatting checks and the internal buffering required to check the next byte nondestructively in unbuffered I/O (and the CPU cycles that come with those things), but the stdlib unfortunately would not. It would have to account for plenty of edge cases I was able to ignore. :-)

FWIW, if it’s possible, a PR for parsing numbers from a text-based I/O without a single heap allocation would be amazing. I just don’t know if it’s feasible.

asterite · September 14, 2021, 8:17pm

@jgaskins Do you have the code for bench_parse_int.cr? I want to try something…

jgaskins · September 14, 2021, 8:18pm

It’s in a <details> element below the results.

asterite · September 14, 2021, 8:40pm

Thanks!

I tried this code and it seems slightly faster:

    def parse_int
      int = 0i64
      negative = false
      peek = @io.peek

      while peek && !peek.empty?
        peek.each_with_index do |byte, index|
          case byte
          when '-'
            negative = true
          when '0'.ord..'9'.ord
            int = (int * 10) + (byte - '0'.ord)
          else
            @io.skip(index)
            return negative ? -int : int
          end
        end
        @io.skip(peek.size)
        peek = @io.peek
      end

      negative ? -int : int
    end

The reason is that IO#peek will check if the IO is open on every call, and I think IO::Buffered does a few other checks.

Then I also tried this once and it’s sliiiiiiiiightly faster, but maybe it’s too much or too verbose:

    def parse_int
      int = 0i64
      negative = false
      peek = @io.peek

      while peek && !peek.empty?
        peek.each_with_index do |byte, index|
          case byte
          when '-'.ord then negative = true
          when '0'.ord then int = int * 10
          when '1'.ord then int = int * 10 + 1
          when '2'.ord then int = int * 10 + 2
          when '3'.ord then int = int * 10 + 3
          when '4'.ord then int = int * 10 + 4
          when '5'.ord then int = int * 10 + 5
          when '6'.ord then int = int * 10 + 6
          when '7'.ord then int = int * 10 + 7
          when '8'.ord then int = int * 10 + 8
          when '9'.ord then int = int * 10 + 9
          else
            @io.skip(index)
            return negative ? -int : int
          end
        end
        @io.skip(peek.size)
        peek = @io.peek
      end

      negative ? -int : int
    end

Could you try these on your end to see what performance benefit you get?

Also, I think the algorithms are correct, but I’m not sure! Specs pass though…

asterite · September 14, 2021, 8:47pm

Actually, this seems faster and simpler, and it’s about twice as fast as the original algorithm

    def parse_int
      int = 0i64
      negative = false
      peek = @io.peek

      if peek && !peek.empty? && peek[0] === '-'
        negative = true
        @io.skip(1)
        peek = @io.peek
      end

      while peek && !peek.empty?
        peek.each_with_index do |byte, index|
          if '0'.ord <= byte <= '9'.ord
            int = int * 10 + (byte - '0'.ord)
          else
            @io.skip(index)
            return negative ? -int : int
          end
        end
        @io.skip(peek.size)
        peek = @io.peek
      end

      negative ? -int : int
    end

asterite · September 14, 2021, 9:01pm

That said… no idea if this would impact a real benchmark of the entire redis client. Maybe not allocating memory when parsing a int is enough, and these micro-optimizations won’t change that overall benchmark.

jgaskins · September 14, 2021, 9:13pm

This is really impressive! I love that you get even more nerd-sniped by stuff like this than I do. :-D It’s a very specific kind of inspiration and I appreciate it so much.

I’m traveling at the moment so I don’t have access to an Intel machine but even on the Apple M1 (where sys calls often have reduced impact) it shows significant gains:

➜  redis git:(master) ✗ crystal run --release bench/parse_int.cr
      String#to_i  25.17M ( 39.73ns) (± 0.79%)  32.0B/op   4.59× slower
      byte parser  82.46M ( 12.13ns) (± 0.50%)   0.0B/op   1.40× slower
ary's byte parser 115.48M (  8.66ns) (± 0.57%)   0.0B/op        fastest

Indeed, I’ve been meaning to put together some more end-to-end benchmarks. What I’ve mainly been using was the benchmark I wrote for this article since they’re the most common Redis operations for apps I tend to work on: getting/setting session/cache data and incrementing counters.

asterite · September 14, 2021, 10:48pm

I just ran the code in the article with both versions: they ran at the same speed

asterite · September 14, 2021, 10:59pm

Thank you! Optimizing things is the thing I enjoy most when programming.

Topic		Replies	Views
Crystal vs Rust: A Comparison	18	10268	January 10, 2022
How to make this faster? Help & Support	10	312	January 23, 2024
Using Rust inside a Crystal program Help & Support	22	3651	April 17, 2025
Crystal integration with Rust Help & Support	11	1650	August 18, 2024
Kudos to Crystal! News	12	2135	December 29, 2021

Faster floating point parsing algorithm

Related topics