Fair enough. I can, but tl;dr.
As the JSON parser is somewhat slow for long strings, I’m replacing some long – and potentially duplicate – string values in the JSON data by much shorter references (i.e. “some_long_blob
” becomes “stringpool:1234
”).
This is all done on Bytes
objects, before even feeding the data to the JSON parser. The parser now has much less work to do (some tens of MB less of data to chew through codepoint-by-codepoint instead of using memchr
).
There are cases when I want the original strings back in the resulting JSON objects, so I need to store the strings somewhere.
StringPool
looked to me like it could handle the task, but I’m not sure any more. My get_str
implementation above is broken. Frankly, I don’t fully understand the code - particularly the undocumented private methods.
The concept works if I don’t try to misuse StringPool
but instead hand-roll my own StringBytesHashPool
:
require "digest/sha256"
class StringBytesHashPool
def initialize
@pool = {} of Bytes => (Bytes | String)
@digester = Digest::SHA256.new
end
delegate :size, :clear, to: @pool
#OPTIMIZE Really simplistic.
def store( input : (Bytes | String) ) : Bytes
@digester.reset
@digester << input
hash_bytes = @digester.final
if @pool.has_key?( hash_bytes ) # Hash key already in use.
if (input != @pool[ hash_bytes ])
raise RuntimeError.new( "Collision in #{self}." )
end
else
@pool[ hash_bytes ] = input
end
return hash_bytes
end
def get( hash : Bytes ) : String?
# Convert the entry to a String once on the first retrieval:
if @pool[ hash ]?.is_a?( Bytes )
@pool[ hash ] = String.new( @pool[ hash ].as( Bytes ))
end
return @pool[ hash ]?.as( String? )
end
end
The implementation doesn’t handle collisions nicely yet, but then again, collisions in 256 bits – even 128 bits – are really unlikely. (A type 4 UUID has 122 random bits.)
This gets me down from e.g. 0.20 s for parsing to 0.05 s for simplifying + 0.01 s for parsing, i.e. from 0.20 s to 0.06 s.
Side note: Makes me wonder if it makes sense to replace all string values in the JSON data before parsing in this way. Obviously, it would make more sense to make the JSON parser work on bytes instead of codepoints in the first place, but I looked into the code and failed to find a good place to hook into.