String views or some pointer into strings

So I have had a few ides that have fallen flat so take this with a gain of salt.

What if some string methods did not return strings but a struct with pointers into strings. Strings are immutable so there should not be case were an existing string would change and invalidate a view. Consider the code below.

struct StringView
  def initialize(
    @buffer : Pointer(UInt8),
    @start : Int32,
    @size : Int32

  def to_s(io : IO) : Nil
    io.write_string( { |i|
      (@buffer + i).value.to_u8

  def [](index) : Char

Some String methods could use what I called string views which are pointers into strings. This allocates a few ints on the stack instead of allocating new strings on the stack. Here is an example of the mutations to the string class.

class String
  def byte_view(start : Int, count : Int) : StringView, start, count)

  def chomp_view
    return, 0, 0) if empty?

    case to_unsafe[bytesize - 1]
    when '\n'
      if bytesize > 1 && to_unsafe[bytesize - 2] === '\r', 0, bytesize - 2)
      else, 0, bytesize - 1)
    when '\r', 0, bytesize - 1)
    else, 0, bytesize)

  NEWLINE_U8  = '\n'.ord.to_u8
  RETURN_CHAR = '\r'

  def each_line_view(chomp = true, &block : StringView ->) : Nil
    # return if empty?

    offset = 0

    while byte_index = byte_index(NEWLINE_U8, offset)
      count = byte_index - offset + 1

      if str = chomp_view
        count -= 1
        count -= 1 if offset + count > 0 && str[offset + count - 1] === RETURN_CHAR

      yield, offset, count)

      offset = byte_index + 1
    end, 0, 0) unless offset == bytesize

  def strip_view
    excess_left = calc_excess_left
    if excess_left == bytesize
      return, 0, 0)

    excess_right = calc_excess_right
      excess_left + excess_right

From some simple benchmarks I made from the sample code from it seems like there are performance advantages from 2X to 6X. What am I missing???

here is a link to some sample code.

I know I am missing something.

1 Like

This idea does come up every now and then. It’s a really clever way to negate a lot of intermediate string allocations, to be sure.

However, I believe the reason it hasn’t been implemented is that as long as any StringView based on it remains in memory, the original String cannot be GCed. In a lot of cases, it’s probably okay to hold onto that memory because it’ll probably only be a few extra milliseconds or seconds before all of it gets GCed anyway — such as in HTTP requests or background jobs that throw away all their allocated memory upon completion. But when it’s not okay it’s really bad. It can use up a whole lot of memory in ways that are nearly impossible to trace.

Let’s say you fetch some JSON from another service and parse various string properties into a StringViews to minimize RAM usage of the parsing operation. Then you cache one or more parts of that parsed object to avoid re-fetching the same data multiple times. In this scenario, you end up caching the entire response body, no matter how large it is. Depending on your cache TTL and the diversity of the inputs to the method that sends those requests, you may end up using a lot of extra memory, despite trying to use less. And that additional memory usage is upstream from the thing that creates the StringViews, making it incredibly difficult to track down the issue.


More discussion here:

If the idea is to reduce allocations for strings used multiple times, could look into StringPool - Crystal 1.10.1. Maybe not applicable here, but it is one of those types that is easy to miss.