[RFC] String#squish

What do you think about adding Rails’ String#squish to the standard library?

It’s a pretty useful method to have in specs when you want to compare some string with another string and you don’t care about the exact whitespace (space, newlines, etc.) in the output, you just care about the content (the non-whitespace) part.

These would be the tests for this:

  describe "squish" do
    it { "   ".squish.should eq("") }
    it { "a".squish.should eq("a") }
    it { "abc".squish.should eq("abc") }
    it { " abc ".squish.should eq("abc") }
    it { " abc \n\r\t   def   ".squish.should eq("abc def") }
    it { " \n\t abc \n\r\t   def  \n  g  h  ".squish.should eq("abc def g h") }
  end

And this is an example implementation:

class String
  def squish : String
    reader = Char::Reader.new(self)

    # Skip initial whitespace
    while reader.current_char.whitespace?
      reader.next_char
    end

    # If we reached the end, we are done
    return "" unless reader.has_next?

    String.build(bytesize) do |io|
      loop do
        # Skip over non-whitespace
        from_pos = reader.pos
        while reader.has_next? && !reader.current_char.whitespace?
          reader.next_char
        end

        # Copy it to the final String
        io.write(to_slice[from_pos, reader.pos - from_pos])

        # Skip whitespace
        while reader.current_char.whitespace?
          reader.next_char
        end

        # If we reached the end, no need to append the trailing whitespace
        break unless reader.has_next?

        # Append a single whitespace
        io << ' '
      end
    end
  end
end
13 Likes

Looks handy, but imo a bit too specific for standard lib (but would never mind it, if it was added anyway - it’s just something I would never expect, and I therefor probably would not even check standard lib for even if it was provided).

I’d like more flexibility, for example joining sequences of newlines. This could be achieved easily by providing the matcher as block argument and replacement char as regular arg.

I’m wondering about the similarity to #squeeze. Semantically, this would be equivalent to squeeze(&.whitespace?) with a custom fixed replacement character ( ).
Perhaps we could consider to merge the implementation into that method? Then you could do squeeze(' ', &.whitespace?). It might still be available as a separate method #squish (#squeeze without arguments is already taken).

2 Likes

Yeah, I thought about that, but squish also removes leading and trailing whitespace, which squeeze doesn’t do.

1 Like

I like it, although I think it could be better with a few options.

  • Control what the singular white space is (space, tab, newline).
  • Allow keeping linebreaks, but removing blank lines.
2 Likes

I’m able to pass all testcases with the following snippet

class String
  def squish
    gsub(/\s+/, " ").strip
  end
end
1 Like

Great! Can you run a benchmark between the two alternatives to see which one is faster?

I popped them both into a quick benchmark, which checks both shorter and slightly longer strings, and ran it on macOS on both Intel and ARM CPUs:

Benchmark code
require "benchmark"

string = nil
short = " abc def "
long = <<-STRING 
  foo
  bar    baz      omg
  asdf lasjkdbflk flk l kl a sdfklhj kljh lk       laksdjf laksdfj laksd jflkas djflk df




  STRING

puts "Short strings"
Benchmark.ips do |x|
  s = short
  x.report "WintereDesert" { string = s.squish_wintere_desert }
  x.report "ary's" { string = s.squish_ary }
end

puts
puts "Long strings"
Benchmark.ips do |x|
  s = long
  x.report "WintereDesert" { string = s.squish_wintere_desert }
  x.report "ary's" { string = s.squish_ary }
end

# This won't do anything, but LLVM can't tell that so it won't optimize it out
pp string unless string

class String
  def squish_ary : String
    reader = Char::Reader.new(self)

    # Skip initial whitespace
    while reader.current_char.whitespace?
      reader.next_char
    end

    # If we reached the end, we are done
    return "" unless reader.has_next?

    String.build(bytesize) do |io|
      loop do
        # Skip over non-whitespace
        from_pos = reader.pos
        while reader.has_next? && !reader.current_char.whitespace?
          reader.next_char
        end

        # Copy it to the final String
        io.write(to_slice[from_pos, reader.pos - from_pos])

        # Skip whitespace
        while reader.current_char.whitespace?
          reader.next_char
        end

        # If we reached the end, no need to append the trailing whitespace
        break unless reader.has_next?

        # Append a single whitespace
        io << ' '
      end
    end
  end


  def squish_wintere_desert
    gsub(/\s+/, " ").strip
  end
end

Results

The simpler solution is definitely easier to read, but takes 2.6-5x as long, likely due to allocating 2-5x more heap memory.

Intel

Short strings
WintereDesert   2.75M (364.26ns) (± 0.67%)  272B/op   2.94× slower
        ary's   8.08M (123.69ns) (± 0.68%)  128B/op        fastest

Long strings
WintereDesert 495.95k (  2.02µs) (± 1.06%)  1.0kB/op   2.65× slower
        ary's   1.31M (761.10ns) (± 2.88%)   224B/op        fastest

M1/ARM

Short strings
WintereDesert   2.66M (376.07ns) (± 0.67%)  272B/op   4.75× slower
        ary's  12.63M ( 79.19ns) (± 2.92%)  128B/op        fastest

Long strings
WintereDesert 380.66k (  2.63µs) (± 0.93%)  1.0kB/op   5.04× slower
        ary's   1.92M (520.83ns) (± 0.82%)   224B/op        fastest
11 Likes