The Crystal Programming Language Forum

Surprising behavior of StringScanner#scan with \A anchor

I find this difference of matching with and without \A anchor surprising:

require "string_scanner"

str = "hello world"

# Regex#match
substr = str[6..-1] # "world"
p! str.match(/world/)      # => Regex::MatchData("world")
p! substr.match(/\Aworld/) # => Regex::MatchData("world")

# StringScanner#scan with offset
scanner = StringScanner.new(str)
scanner.offset = 6
p! scanner.scan(/world/)   # => "world"

scanner = StringScanner.new(str)
scanner.offset = 6
# I expect it to work the same as if it scans substring "world"
p! scanner.scan(/\Aworld/) # => nil

At first I thought that it might be a bug in StringScanner. Then I found out that Regex#match_at_byte_index behaves the same:

str = "hello world"

p! /world/.match_at_byte_index(str, 6)   # => Regex::MatchData("world")
p! /\Aworld/.match_at_byte_index(str, 6) # => nil

And then I verified that in Ruby it’s the same too:

p /world/.match("hello world", 6)'  # => #<MatchData "world">
p /\Aworld/.match("hello world", 6) # => nil

I wonder why it is like that. Maybe there is some reason and with Regex#match_at_byte_index I maybe can get used to it, but it’s still suprising.

But then with ScringScanner the offset may not be even visible when I work with it. I would I call scan sequentially and I expect it to work with substring str[offset..-1] ignoring everything before offset.

Isn’t the current behavior of at least StringScanner in Crystal surprising?

StringScanner in Ruby it works as you expected:

require "strscan"

str = "hello world"
scanner = StringScanner.new(str)
scanner.pos = 6
scanner.scan /\Aworld/ # => "world"

So I suppose this is probably a bug in our implementation. I’m not familiar with StringScanner at all, though.

1 Like

\A means “beginning of string”, and “world” doesn’t happen at the beginning of the string. So, if anything, I think this is incorrect behavior in Ruby. Maybe they can’t fix it at this point.

Oh, well, I guess if you pass a position then that is the beginning that should be considered. Please open a bug report. Thanks!

1 Like

Thanks for pointing it out. How come I haven’t tried it in Ruby myself first?

I wonder why in Ruby Regex#match with start position is not producing the same result as with StringScanner though. Could be legacy maybe?

require "strscan"

str = "hello world"
scanner = StringScanner.new(str)
scanner.pos = 6
scanner.scan /\Aworld/ # => "world"

# Why this was chosen in Ruby?
/\Aworld/.match(str, 6) # => nil

I might have time to look into it more over weekend and try to find the reasoning behind the difference between StringScanner#scan and Regex#match with starting position in Ruby and then hopefully with some fix for Crystal StringScanner#scan if we are to match Ruby behavior.

I would ask in the Ruby forums. It’s probably legacy behavior.

To get a quick response from Ruby devs may be better to frame the question as a Ruby issue and ask here:

1 Like

So I’ve opened a bug in Ruby tracker about different behavior of Regex#match(str, position) and StringScanner#scan when using \A or ^ but it was rejected https://bugs.ruby-lang.org/issues/18471?tab=history#note-1

Looks like they are not 100% sure and are speculating about why it’s like so. Interestingly they can see the potential of StringScanner#scan behavior in Ruby to be confusing (it matches with \A) and not the other way around like it is for me (Regex#match does not match with starting position and \A).

I think changing this behavior difference in Ruby even if it was considered incorrect would not be possible.

Also this one works in Ruby (unlike \A):

str = "hello world"
/\Gworld/.match(str, 6) # => #<MatchData "world">

Found this description of \G for PCRE regex engine in PHP docs:

The \G assertion is true only when the current matching position is at the start point of the match, as specified by the offset argument of preg_match(). It differs from \A when the value of offset is non-zero.

So looks like Ruby uses the same logic and \A means search from the start of the string and \G means a start point of the match which is specified by a second argument in Regex#match.

Crystal behavior of StringScanner#scan and Regex#match when using starting position are the same (\G matches when there is a non zero offset and \A does not match):

require "string_scanner"

str = "hello world"

p! str.match(/world/, 6)      # => Regex::MatchData("world")
p! str.match(/\Aworld/, 6)    # => nil
p! str.match(/\Gworld/, 6)    # => Regex::MatchData("world")

scanner = StringScanner.new(str)
scanner.offset = 6
p! scanner.scan(/world/)   # => "world"

scanner = StringScanner.new(str)
scanner.offset = 6
p! scanner.scan(/\Aworld/) # => nil

scanner = StringScanner.new(str)
scanner.offset = 6
p! scanner.scan(/\Gworld/) # => "world"

p! /world/.match_at_byte_index(str, 6)   # => Regex::MatchData("world")
p! /\Aworld/.match_at_byte_index(str, 6) # => nil
p! /\Gworld/.match_at_byte_index(str, 6) # => Regex::MatchData("world")

Now question is how StringScanner#scan should behave in Crystal?

Should it match Ruby difference or keep the current behavior?

Difference in behavior of StringScanner and Regex with starting position and \A or ^ anchors in Ruby is considered expected with this reasoning:

Moving “the position in the string to begin the search” does not mean the string will be truncated. It only moves the cursor position in the same string. It does not create a new string or a new edge of a string.

On the other hand, I find the behavior of StringScanner#scan to be potentially confusing, but I think I can understand why it is designed so. To my understanding, as you read from a StringScanner instance, the matched part is consumed, the original content is truncated, and indeed new edges are created as you read from it. (Saying somewhat metaphorically. I do not know about the actual implementation.)

Thus, I think String and StringScanner differ in nature. And particularly, String#match is not destructive whereas StringScanner#scan is destructive.