Interpolate regex at compile time

Hello,

I would like to know if there is a way to interpolate a regex at compile time ?

I would like to do this:

macro interpolate_regex(regex)
  /\A#{ {{regex}} }/
end
interpolate_regex(/foo/)          # => /\A(?-imsx:foo)/
interpolate_regex(/foo/ix) # => /\A(?ix-ms:foo)/
interpolate_regex(/\n\/foo/) # => /\A(?-imsx:\n\/foo)/

But with interpolation happening at compile time.

I tried this:

macro interpolate_regex(regex)
  /\A{{regex.source.id}}/
end

But doesn’t work if regex have options or escapes.

interpolate_regex(/foo/) # => /\Afoo/
interpolate_regex(/foo/ix) # => /\Afoo/
interpolate_regex(/\n\/foo/) # => unknown regex option: f (Expanded /\A\n/foo/ ) 

The best I could have is that:

macro interpolate_regex(regex)
    {%
      str = "(?"
      str += 'i' if regex.options.includes?(:i)
      str += "ms" if regex.options.includes?(:m)
      str += 'x' if regex.options.includes?(:x)
      str += '-'
      str += 'i' unless regex.options.includes?(:i)
      str += "ms" unless regex.options.includes?(:m)
      str += 'x' unless regex.options.includes?(:x)
      str += ':'
      str += regex.source
      str += ')'
    %}
    /\A{{str.id}}/
  end

But this still doesn’t work in the case of /\n\/foo/ (the / doesn’t get escaped). I wonder if there is a shorter/cleaner way?

I think it could be nice to have something like RegexLiteral#to_s, that give a regex ready to interpolate.

FWIW might be worth doing some benchmarks to see if this would even be beneficial. PCRE does pattern compilation/caching itself, so majority of the performance boost might come from that. Plus LLVM might just be smart enough to optimize it to a more performant regex even before it gets to PCRE.

1 Like

Yeah, I wouldn’t expect any relevant performance gains from macro expansion. But there might be other reasons than optimization.

RegexLiteral#source returns the raw string describing the regular expression. It is not aware of the syntax for expressing regular expressions in Crystal (delimited by forward slashes). If you want to manually embed it in such a literal, you need to escape the delimiter (a simple .gsub(/\//, "\\/") should probably do).

I suppose the macro language could support interpolation in regex literals as well. Then you could implement it like this:

macro interpolate_regex(regex)
  {{ /\A#{ regex }/ }}
end

This could be a feature request. String interpolation already works like that in the macro language.

1 Like

Thanks for the answers, the .gsub(/\//, "\\/") is really helpful!

The performance difference is huge if you consider only comparison between doing the interpolation and doing nothing. However the difference is still non-negligible is you count a matching phase after.

Anyway, it add an extra cost to does the interpolation and call Regex#to_s, which could be negligible or not depending of what you does aside.

benchmark:

require "benchmark"

macro runtime(regex)
  /foo#{ {{regex}} }/
end

macro compiletime(regex)
  /foo(?-imsx:{{regex.source.id}})/
end

p runtime(/bar/)     # => /foo(?-imsx:bar)/
p compiletime(/bar/) # => /foo(?-imsx:bar)/

Benchmark.ips do |x|
  x.report("runtime interpolation") { runtime(/bar/) }
  x.report("compiletime interpolation") { compiletime(/bar/) }
end
# the difference is huge because it compare almost nothing with something.

short_text = "foobar"
Benchmark.ips do |x|
  x.report("short runtime interpolation + text matching") { runtime(/bar/) =~ short_text }
  x.report("short compiletime interpolation + text matching") { compiletime(/bar/) =~ short_text }
end

long_text = "foobaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaar"
Benchmark.ips do |x|
  x.report("long runtime interpolation + text matching") { runtime(/baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaar/) =~ long_text }
  x.report("long compiletime interpolation + text matching") { compiletime(/baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaar/) =~ long_text }
end

results:

    runtime interpolation 120.55k (  8.30µs) (±24.10%)  9.13kB/op  2384.45× slower
compiletime interpolation 287.45M (  3.48ns) (±19.54%)    0.0B/op          fastest
    short runtime interpolation + text matching  99.00k ( 10.10µs) (±22.38%)  9.14kB/op  110.90× slower
short compiletime interpolation + text matching  10.98M ( 91.08ns) (±20.92%)   16.0B/op         fastest
    long runtime interpolation + text matching  57.96k ( 17.25µs) (±20.53%)  10.0kB/op  116.47× slower
long compiletime interpolation + text matching   6.75M (148.12ns) (±20.68%)   16.0B/op         fastest

My original usecase (for probably my next shard! :slight_smile: ), is to provide a parser API, in which user could use regex. For a simple json parser, passing from /\A{{regex.source.id}}/ to /\A#{ regex }/ make the performance x60 worst.

I don’t know is I had the only usecase, anyway with the gsub it become possible, and it’s totally fine to me. Crystal is definitively awesome!

1 Like

Could you cache the interpolated regex? It only works faster if you write /foo/ because the compiler actually caches that regex in a hidden constant. So if you assign that regex somewhere and reuse that, the performance difference should be negligible.

Oh, that interesting!, thanks.

Yes in effect once cached, there no performance difference at all.

The benefit to have compile time interpolation would be still to have the regex cached automatically (without even knowing that!) instead of doing it manually, but it’s marginal.