Regex counts certain characters wrong. How to fix?

Considering this code:,

bad_header = "ABCDEFGHIJKLMNOP\x93\x92JO"
puts "Header Sizes: #{good_header.size}, #{bad_header.size}"
good_input = good_header + "!!!!!!!!!!"
bad_input = bad_header + "!!!!!!!!!!"
regex = /(?<=.{20})(.+)/

m = good_input.match(regex)
if m
  pp "Good match : " + m[0]
  puts "No match!"

m = bad_input.match(regex)
if m
  pp "Bad match  : " + m[0]
  puts "No match!"


Header Sizes: 20, 20
"Good match : !!!!!!!!!!"
"Bad match  : !!!!!!!!"

The headers are exactly 20 bytes long, Crystal sees it that way. When plugging in a regex that ignores the first 20 chars, and captures everything after, it ignores the “\x93\x92” and captures the wrong range (it shouldn’t capture the !! in the header.

Anyone know what I can do to get regex to count those characters?

The issue here is that \x93 and \x92 are not valid UTF-8 encodings. Crystal strings allow non-UTF-8 codepoints, but the regex engine (pcre) probably just ignores them and treats non-UTF-8 codepoints as non-existing. Thus they’re not counted in the regex. This makes some kind of sense because invalid UTF-8 codepoints don’t represent a character.

In order to use strings with regular expressions, you need to make sure they’re valid UTF-8 strings. Crystal currently does not enforce valid UTF-8 automatically but this may change in the future.
But you can use String#scrub to replace invalid codepoints with a replacement character.

bad_input.scrub.match(regex) # => "!!!!!!!!!!"
1 Like

Unfortunately scrub won’t work for what I need. I was hoping to allow the selection of bytes this way but looks like i’ll need to think of another solution.

I think there’s a flag in pcre to control utf-8 or not. If this is a valid use case we might want to expose it, somehow. Or, well, you. can always create a Regex instance with new and configure it (you’ll have to check the source code in the standard library).

1 Like

Will do! I would love to use it. I’m trying to use it in my Fuzzing library crowbar, I figured it might be easy to select which text to mutate using regex.

Actually, the regex options get hardcoded the UTF-8 option so there’s no way to do it without modifying Regex.

But you can do this (I don’t like it but it works):

1 Like

This is a good work around for me, but it might not work for all regexes that get pumped into Crowbar. Either way, it might be a good idea to include this behavior as a flag to regex somehow. If you think that’s a good idea I can write up an issue on GitHub. I was thinking maybe it could work like an extra flag/modifier, like case-insensitive i, global g, or multiline m. Maybe capitol B for Byte. Example: /(?<=.{20})(.+)/B or /(?B)(?<=.{20})(.+)/. Although, I’m sure you guys don’t want to deviate from the Regex standard to hard. What do you think?

I don’t know, there are some issues related to regex and non-utf8, for example (maybe it’s not related). I can’t decide on this because I’m not super familiar with regex over bytes. I wonder how it works in Ruby.

I tried it in ruby and it complains about invalid UTF-8.
I tried messing around with the encoding but nothing I did worked.
Tried the n flag but it didn’t work.

It works if you change the String encoding to be a slice:

So maybe we could make a Regex match a slice of bytes… not sure how yet.

(edit: wow, apparently if you put the link in a separate line it embeds it here! :tada:)

1 Like

Very cool! I actually would love that functionality because a lot of the work I do is at the byte level (lots of fuzzing and byte level manipulation). For now I came up with a solution for Crowbar, create a Header selector that only selects the first x bytes using crystal instead of a regex.

(also didn’t know about the repl thing. Super cool!)

Given that String might eventually become UTF-8-only, we could consider adding a regex method to Slice. But then it’s another Byte-specific method in the general use Slice type…
Anyway, I guess we should continue this discussion on the issue tracker. There seems to be a use case (I’m not sure how common it is, though).

One more thing to add: the fact that a Regex can handle utf-8 or raw bytes is defined in its constructor. So it’s a bit hard right now for a regex to handle both cases.

If this isn’t a very common case maybe non-utf-8 regexes can be handled by a shard.

1 Like