I wish to add @flags : UInt8
to String to memoize: is it ascii? single byte optimizable? Valid utf8?
Such information could be used to speedup searching.
Unfortunately I failed to do it. I found couple of places in compiler infrastructure (LLVM typer and program.cr), but it was not enough to compile successfully anything’s (barks on dwarfs or something close to). Looks like there are more places to fix.
Please, suggest where I should look to put more changes?
There is an open issue about this proposal:
opened 09:35AM - 22 Apr 22 UTC
kind:feature
performance
topic:stdlib:text
tough-cookie
I just stumbled across this article in the current issue of Ruby Weekly: [*Code … Ranges: A Deeper Look at Ruby Strings*](https://shopify.engineering/code-ranges-ruby-strings). It's an interesting read about an internal optimization Ruby interpreters use to characterize a string. Basically, it caches the results of `ascii_only?` and `valid_encoding?`, thus they run in *O(1)* instead of *O(N)* (on repeated use).
We could apply this optimization to Crystal's `String` class as well. It's even easier to implement because strings are immutable in Crystal (so we don't have to worry about invalidation) and there is only one valid encoding (UTF-8). It's also more efficient for string literals because the compiler can calculate code range values ahead of time.
For implementing this, we need 2 bits in the string header to represent the four states:
* ascii_only
* valid
* broken
* unknown
We could try to squeeze that into the existing header values, which would reduce the effectively available string size a bit. So it would probably be considered a breaking change.
But I don't think there would be any problems with increasing the size of the string header. It just needs to be synchronized between compiler and runtime (which can be signalled via a compile time flag).
I may have a branch with a prototype implementation for this laying around. I’ll have to look that up.
1 Like
Good day.
Did you find the branch?
I see: I missed build_string_constant, and didn’t thought about trick in initialize_header.
Will try to cherry-pick your commit and go further.
1 Like
Yeah, that same layout information repated across three different locations is a bit hard to track.
I had previously investigated this for Refactor String header layout reflection by straight-shoota · Pull Request #13335 · crystal-lang/crystal · GitHub , so I know where to look. Not sure what I did back then to find this.