Regex that is multiline but not dotall, how?

It seems that to match some behaviour from golang I need a regex that is multiline but not dotall.

Specifically, I need

/#.*$/ to only match the first line in “# comment\nnotacomment\n”

If it’s not multiline, it matches nothing.
Because multiline in Crystal implies dotall, /#.*$/m matches the two lines

icr:1> /#.*$/m.match "#comment\nnotacomment"
 => Regex::MatchData("#comment\nnotacomment")
icr:2> /#.*$/.match "#comment\nnotacomment"
 => nil

In other languages it’s possible. In Python, this does what I need:

>>> re.match("#.*$", "#comment\nnotacomment", re.MULTILINE) 
<re.Match object; span=(0, 8), match='#comment'>

If you can help me get this working, we all get a very nice port of Pygments to Crystal in a day or two :-D


In the unlikely case someone else needs this very specific thing, here’s a workaround which seems to work and let me use the forbidden flag combos:

class Re2 < Regex
  @source = "fa"
  @options = Regex::Options::None
  @jit = true

  def initialize(pattern : String, multiline = false, dotall = false, ignorecase = false)
    flags = LibPCRE2::UTF | LibPCRE2::DUPNAMES | LibPCRE2::UCP
    flags |= LibPCRE2::MULTILINE if multiline
    flags |= LibPCRE2::DOTALL if dotall
    flags |= LibPCRE2::CASELESS if ignorecase
    @re = Regex::PCRE2.compile(pattern, flags) do |error_message|
      raise Exception.new(error_message)
    end
  end
end

p! Re2.new("#.*$", multiline: true, dotall: false).match("#comm\nnotcomm")

1 Like

I think you could just use the regex #.* without any modifiers given . excludes line breaks by default, it’ll just match up to the first newline.

I can’t change the regexes, they are 2MB of data files from another project describing language lexers.

I feel like there’s something in your question that isn’t clear. You say you can’t change the regexes, but you asked for a regex. On the surface, those two things sound mutually exclusive. Can you clarify what you do need?

Are you looking for ways to shoehorn that specific regex to match the first line of that string?

I have a few hundred XML files that describe grammars. One of the ways in which they do it is by listing regular expressions, such as

      <rule pattern="^([ \t\f]*)([#!].*)">
        <bygroups>
          <token type="Text"/>
          <token type="CommentSingle"/>
        </bygroups>
      </rule>

I can’t change those regular expressions because there are rought 8000 of them. What my code does is parse those XML files and create Crystal objects which are used to parse text.

As part of parsing I do something like this:

pattern = Regex::New(node["pattern"],
                  Regex::Options::ANCHORED | Regex::Options::MULTILINE)

What I want is a crystal Regex object that has the MULTILINE option but not the DOTALL option enabled, because that object behaves the way the parsers need it to behave.

Hope that’s clearer.

Yea I don’t think there’s a way to do this at the moment. Even if you use the underlying value for MULTILINE, the compiler will prevent it: crystal/src/regex/pcre2.cr at 405f313e071aaaa35c824632dfdbfc8fdc2a658b · crystal-lang/crystal · GitHub.

Would probably have to add in a new option, something like STRICT_MULTILINE that maps only to MULTILINE PCRE2 option given MULTILINE is already taken.

For now I am running a hacked up crystal compiler, I’ll try to find a way to monkeypatch it, but if this option could exist in a future version, it would be nice ( I may propose a PR I guess :-) )

Well, couldn’t your parser rewrite the regex, so . is replaced by <something that matches anything but newline>? Does it need to use the exact same regex?

The odds of my code rewriting random subsets of the 8000 regexes and not breaking things is zero :-D

I have found workarounds to set the right flag combination, it’s just that Crystal’s refusal to set a number to 4 (MULTILINE) rather than 6 (MULTILINE | DOTALL) is a bit frustrating :rofl:

And the reason for this is…

Which is a port of Pygments, the python syntax highlighting library.

Currently only the lexing side is implemented, but that’s the hard part :-D

I think it should be possible to enable the MULTILINE option alone without DOTALL. The current API doesn’t allow that though. The primary issue is that PCRE and PCRE2 use different values, so we cannot pass the flags directly and instead need some translation. This means it’s not possible to just use a custom value and we need to add explicit code for this.

As a minor invoncenience, we’re carrying legacy baggage in naming because CompileOptions::MULTILINE means MULTILINE | DOTALL.
So a constant for only MULTULINE needs to be called something like MULTILINE_ONLY. :person_shrugging:

1 Like

Cool, I will try to do a PR this week