String views or some pointer into strings

So I have had a few ides that have fallen flat so take this with a gain of salt.

What if some string methods did not return strings but a struct with pointers into strings. Strings are immutable so there should not be case were an existing string would change and invalidate a view. Consider the code below.

struct StringView
  def initialize(
    @buffer : Pointer(UInt8),
    @start : Int32,
    @size : Int32
  )
  end

  def to_s(io : IO) : Nil
    io.write_string(Bytes.new(@size) { |i|
      (@buffer + i).value.to_u8
    })
  end

  def [](index) : Char
    @buffer[index].unsafe_chr
  end
end

Some String methods could use what I called string views which are pointers into strings. This allocates a few ints on the stack instead of allocating new strings on the stack. Here is an example of the mutations to the string class.


class String
  def byte_view(start : Int, count : Int) : StringView
    StringView.new(to_unsafe, start, count)
  end

  def chomp_view
    return StringView.new(to_unsafe, 0, 0) if empty?

    case to_unsafe[bytesize - 1]
    when '\n'
      if bytesize > 1 && to_unsafe[bytesize - 2] === '\r'
        StringView.new(to_unsafe, 0, bytesize - 2)
      else
        StringView.new(to_unsafe, 0, bytesize - 1)
      end
    when '\r'
      StringView.new(to_unsafe, 0, bytesize - 1)
    else
      StringView.new(to_unsafe, 0, bytesize)
    end
  end

  NEWLINE_U8  = '\n'.ord.to_u8
  RETURN_CHAR = '\r'

  def each_line_view(chomp = true, &block : StringView ->) : Nil
    # return if empty?

    offset = 0

    while byte_index = byte_index(NEWLINE_U8, offset)
      count = byte_index - offset + 1

      if str = chomp_view
        count -= 1
        count -= 1 if offset + count > 0 && str[offset + count - 1] === RETURN_CHAR
      end

      yield StringView.new(to_unsafe, offset, count)

      offset = byte_index + 1
    end

    StringView.new(to_unsafe, 0, 0) unless offset == bytesize
  end

  def strip_view
    excess_left = calc_excess_left
    if excess_left == bytesize
      return StringView.new(to_unsafe, 0, 0)
    end

    excess_right = calc_excess_right
    StringView.new(
      to_unsafe,
      excess_left,
      excess_left + excess_right
    )
  end
end

From some simple benchmarks I made from the sample code from crystal-lang.org it seems like there are performance advantages from 2X to 6X. What am I missing???

here is a link to some sample code.

I know I am missing something.

1 Like

This idea does come up every now and then. It’s a really clever way to negate a lot of intermediate string allocations, to be sure.

However, I believe the reason it hasn’t been implemented is that as long as any StringView based on it remains in memory, the original String cannot be GCed. In a lot of cases, it’s probably okay to hold onto that memory because it’ll probably only be a few extra milliseconds or seconds before all of it gets GCed anyway — such as in HTTP requests or background jobs that throw away all their allocated memory upon completion. But when it’s not okay it’s really bad. It can use up a whole lot of memory in ways that are nearly impossible to trace.

Let’s say you fetch some JSON from another service and parse various string properties into a StringViews to minimize RAM usage of the parsing operation. Then you cache one or more parts of that parsed object to avoid re-fetching the same data multiple times. In this scenario, you end up caching the entire response body, no matter how large it is. Depending on your cache TTL and the diversity of the inputs to the method that sends those requests, you may end up using a lot of extra memory, despite trying to use less. And that additional memory usage is upstream from the thing that creates the StringViews, making it incredibly difficult to track down the issue.

3 Likes

More discussion here:

If the idea is to reduce allocations for strings used multiple times, could look into StringPool - Crystal 1.10.1. Maybe not applicable here, but it is one of those types that is easy to miss.