StringPool: make the hash key part of the public API

What do you think about making the hash key used in StringPool part of the public API?

Currently, these methods are private:

  • private def hash(str, len)
  • private get(hash : UInt64, str : UInt8*, len)

I’d like to store a String in the pool and get the hash key.
And then, of course, I need a way to retrieve the String by a given hash key. Something like this:

require "string_pool"

class StringPool
  
  # Store a `String` in the pool and return the hash key (an `UInt64`).
  #
  def get_hash( str : String )
    return get_hash( str.to_unsafe, str.bytesize )
  end
  def get_hash( str_ptr : UInt8*, len ) : UInt64
    hash : UInt64 = hash( str_ptr, len )
    get( hash, str_ptr, len )
    return hash
  end
  
  # Get a `String` out of the pool by a given hash key (an `UInt64`).
  #
  def get_str( hash : UInt64 ) : String?
    mask = (@capacity - 1).to_u64
    index = hash & mask
    return nil  if @hashes[ index ].zero?
    str = @values[ index ]
    return str
  end
end

Related:

Could you explain your use case for this?

Fair enough. I can, but tl;dr.

As the JSON parser is somewhat slow for long strings, I’m replacing some long – and potentially duplicate – string values in the JSON data by much shorter references (i.e. “some_long_blob” becomes “stringpool:1234”).

This is all done on Bytes objects, before even feeding the data to the JSON parser. The parser now has much less work to do (some tens of MB less of data to chew through codepoint-by-codepoint instead of using memchr).

There are cases when I want the original strings back in the resulting JSON objects, so I need to store the strings somewhere.

StringPool looked to me like it could handle the task, but I’m not sure any more. My get_str implementation above is broken. Frankly, I don’t fully understand the code - particularly the undocumented private methods.

The concept works if I don’t try to misuse StringPool but instead hand-roll my own StringBytesHashPool:

require "digest/sha256"

class StringBytesHashPool
  
  def initialize
    @pool = {} of Bytes => (Bytes | String)
    @digester = Digest::SHA256.new
  end
  
  delegate :size, :clear, to: @pool
  
  #OPTIMIZE Really simplistic.
  def store( input : (Bytes | String) ) : Bytes
    @digester.reset
    @digester << input
    hash_bytes = @digester.final
    if @pool.has_key?( hash_bytes )  # Hash key already in use.
      if (input != @pool[ hash_bytes ])
        raise RuntimeError.new( "Collision in #{self}." )
      end
    else
      @pool[ hash_bytes ] = input
    end
    return hash_bytes
  end
  
  def get( hash : Bytes ) : String?
    # Convert the entry to a String once on the first retrieval:
    if @pool[ hash ]?.is_a?( Bytes )
      @pool[ hash ] = String.new( @pool[ hash ].as( Bytes ))
    end
    return @pool[ hash ]?.as( String? )
  end
end

The implementation doesn’t handle collisions nicely yet, but then again, collisions in 256 bits – even 128 bits – are really unlikely. (A type 4 UUID has 122 random bits.)

This gets me down from e.g. 0.20 s for parsing to 0.05 s for simplifying + 0.01 s for parsing, i.e. from 0.20 s to 0.06 s.

Side note: Makes me wonder if it makes sense to replace all string values in the JSON data before parsing in this way. Obviously, it would make more sense to make the JSON parser work on bytes instead of codepoints in the first place, but I looked into the code and failed to find a good place to hook into.

Hm, I don’t think I understand the concept entirely. So I’m not sure if it makes even sense to do this (for which benefit? you still need to parse the JSON to figure out where the strings are, right?). It certainly looks more complicate to me than Sharing `String`s between deserializations of a `JSON::Serializable` type · Issue #13638 · crystal-lang/crystal · GitHub which goes in a similar direction.
As you said StringPool might not be the right tool for what you want to do. It’s meant to deduplicate if you have many instances of the same string value, not shorten references to strings.

However, have you seen Optimize JSON parsing a bit by asterite · Pull Request #14366 · crystal-lang/crystal · GitHub ?
The JSON parsing algorithm can be optimized quite a bit without any fancy string caching.

Well, in a JSON data Bytes, I’m basically looking for occurrences of %|,"image":"|.to_slice and then the end of the string '"'.ord.to_u8 (using memchr under the hood). Not sure if that qualifies as parsing. I’d call it messing around with the JSON in its serialized form. It’s quite a bit faster than how the JSON lexer consumes strings.

Why am I doing it? For the fun and for the learning experience, I guess.

Anyways, I didn’t intend to mention JSON parsing in this thread. :-)

I see now. You want to optimise for a specific JSON document structure.

I think a couple of generic improvements to JSON parsing like I mentioned would probably reduce the effect of your optimization. When the JSON parser doesn’t decide UTF-8, it shouldn’t be slowed down as much by large strings.

I have seen that pull request, but I failed to try to use these changes by monkey-patching, IIRC.

Absolutely. My hackish optimization shows how much room for optimization there is in the JSON lexer/parser. Once the JSON module gets to that point, my optimization should not provide any performance gain.

For anyone looking for my solution (forum search, Google, …), here’s what I’m currently using as kind of a key-value store. Work in progress.

StringBytesHashPool (pretty bad name)
require "digest/crc32"

# A storage pool for `Bytes` and `String` objects.
#
class StringBytesHashPool
  
  def initialize
    @pool = {} of UInt64 => (Bytes | String)
    
    # Currently uses CRC32, as that's provided by Crystal's StdLib and it's
    # fast. Might as well use xxHash, MurmurHash, FarmHash or CityHash.
    @digester = Digest::CRC32.new
  end
  
  delegate :size, :clear, to: @pool
  
  # Store *`input`* (`Bytes` or `String`) and return the hash key as a `UInt64`.
  #
  def store( input : (Bytes | String) ) : UInt64
    @digester.reset
    @digester << input
    hash_bytes = @digester.final
    crc_int = IO::ByteFormat::LittleEndian.decode( UInt32, hash_bytes )
    base_key : UInt64 = crc_int.to_u64 << 32
    key = base_key
    end_key = base_key + UInt32::MAX
    loop{
      unless @pool.has_key?( key )
        @pool[ key ] = input
        return key
      end
      
      return key  if input == @pool[ key ]
      
      # Collison. Try the next key:
      if key == end_key
        raise RuntimeError.new( "Storage exhausted in #{self}." )
      end
      key += 1
      next
    }
  end
  
  # Store *`input`* (`Bytes` or `String`) and return the hash key as a
  # Base62-encoded `String`.
  #
  # The digits used for the Base62 encoding are:
  # "`0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz`".
  #
  # The hash key is guaranteed to be <= 11 chars and in many cases will be
  # <= 6 chars.
  #
  def store_base62( input : (Bytes | String) ) : String
    hash = store( input )
    
    # If nothing is in the lower bits (0x……………………_00000000_u64), then just
    # use the high bits. Results in max. "4GFfc3" (6 chars)
    hash >>= 32  if (hash & 0x00000000_ffffffff_u64).zero?
    
    return hash.to_s( 62 )  # max. is UInt64::MAX.to_s(62) == "lYGhA16ahyf" (11 chars)
  end
  
  # Retrieve an entry from the pool as a `String` by a hash key.
  #
  # Converts the entry to a String once on the first retrieval.
  #
  # Returns `nil` if the key isn't found.
  #
  def get?( key : UInt64 ) : String?
    if @pool[ key ]?.is_a?( Bytes )
      @pool[ key ] = String.new( @pool[ key ].as( Bytes ))
    end
    return @pool[ key ]?.as( String? )
  end
  
  # :ditto:
  def get_base62?( hash_base62 : String ) : String?
    hash = UInt64.new( hash_base62, 62 )
    
    # If hash <= UInt32::MAX, then restore the optimization, i.e. restore
    # the low bits as high bits.
    hash <<= 32  if (hash & 0xffffffff_00000000_u64).zero?
    
    return get?( hash )
  end
end