Fair critics about Crystal's Char type

asterite · July 17, 2022, 10:42am

See Open Source Adventures: Episode 75: Issues with Crystal Char type - DEV Community

My summary:

The name Char is misleading because it implies a character, but in reality it’s just a codepoint
Char can’t represent graphemes. Maybe Char should have been Grapheme instead.
Comparison between Char and String always gives false
Matching a Char against a regex always gives nil

That said, I’m a bit reluctant about the part “Most modern languages don’t have a Char type”:

Rust has one, which is the same as Crystal
Golang has a rune type, which is equivalent to Crystal’s Char
Nim’s char is a byte
Swift has a Character type and it’s like a grapheme

So what are all these modern programming languages that don’t have a Char type?

If Crystal 1.0 didn’t release yet, I would consider renaming Char to Codepoint, and making all String operations return String or Grapheme, but I think now it’s a bit too late because of backwards compatibility.

So some questions:

Should we let matching a Char against a Regex?
Should we add upcase and downcase to Grapheme? I noticed they don’t exit yet
Any other thoughts you have on this?

asterite · July 17, 2022, 11:04am

I forgot to mention: Elixir works very well regarding strings, and there’s ko Char type. But so far the only languages without a Char or with one that has a default Grapheme type are Elixir and Swift. Maybe those are the only modern languages, so “most modern languages” would be accurate, I guess :-)

stakach · July 17, 2022, 9:06pm

I’ve personally not had any issues using Chars and think they are important for performance.
If I do "example,usage".split(',') I don’t need a string class being allocated to the heap for the ',' param.

From my reading of the article the main complaint really boils down to String#[] returning a Char type and not a string of length 1. Which seems like a reasonable change to me.

We could remove Char#upcase or have it return a String as I feel like accuracy here is more important than maintaining the type (make it a shortcut for converting the char to a string and then calling upcase on a string)

I don’t mind if "a" == 'a' is true, but current behaviour doesn’t bother me.

hugopl · July 18, 2022, 4:14am

I didn’t even know that there are languages that a single character in upcase becomes 2 characters

"ß".upcase # => "SS"

Blow Mind Mind Blown GIF - Blow Mind Mind Blown Explode - Discover & Share GIFs

stakach · July 18, 2022, 4:51am

it’s a pretty new addition to the language (2017)

yxhuvud · July 18, 2022, 9:19am

From my reading of the article the main complaint really boils down to String#[] returning a Char type and not a string of length 1. Which seems like a reasonable change to me.

Access to the old variant will still be needed though, there are many situations where the performance impact of creating a bazillion one character strings would be unwanted.

Should we let matching a Char against a Regex?

Well, what are the implications? Does the char have to be converted (and thus allocated) to a string to do it or is it possible to do without further overhead?

I guess part of the problem also is how to handle regexps defined with the case insensitive modifier. Does the answer to the former impact that?

I don’t really have any opinion on == status for Char vs String

asterite · July 18, 2022, 11:34am

Makes sense!

Just a note that when you write a string literal like “,” there’s never memory allocation. The string data is put into the read-only-memory of the program.

That said, if you called "hello".chars and we’d have to allocate one String per char, that would incur a lot of memory allocations (unless we also have a way to represent small strings in an efficient way, but that makes things more complex)

HertzDevil · July 18, 2022, 12:42pm

You can force #[] to return a string by doing str[index, 1]. This does involve an allocation though.

You could see all the special characters that upcase/downcase to multiple codepoints:

module Unicode
  def self.show_all(cases)
    cases.to_a.sort_by(&.first).each do |k, v|
      puts "%s (U+%04X) => %s (%s)" % {
        k.chr,
        k,
        v.map { |c| c == 0 ? "" : c.chr }.join,
        v.compact_map { |c| c == 0 ? nil : "U+%04X" % c }.join(' '),
      }
    end
  end

  puts "upcase:"
  show_all(special_cases_upcase)
  puts "downcase:"
  show_all(special_cases_downcase)
end

upcase:
ß (U+00DF) => SS (U+0053 U+0053)
ŉ (U+0149) => ʼN (U+02BC U+004E)
ǰ (U+01F0) => J̌ (U+004A U+030C)
ΐ (U+0390) => Ϊ́ (U+0399 U+0308 U+0301)
ΰ (U+03B0) => Ϋ́ (U+03A5 U+0308 U+0301)
և (U+0587) => ԵՒ (U+0535 U+0552)
ẖ (U+1E96) => H̱ (U+0048 U+0331)
ẗ (U+1E97) => T̈ (U+0054 U+0308)
ẘ (U+1E98) => W̊ (U+0057 U+030A)
ẙ (U+1E99) => Y̊ (U+0059 U+030A)
ẚ (U+1E9A) => Aʾ (U+0041 U+02BE)
ὐ (U+1F50) => Υ̓ (U+03A5 U+0313)
ὒ (U+1F52) => Υ̓̀ (U+03A5 U+0313 U+0300)
ὔ (U+1F54) => Υ̓́ (U+03A5 U+0313 U+0301)
ὖ (U+1F56) => Υ̓͂ (U+03A5 U+0313 U+0342)
ᾀ (U+1F80) => ἈΙ (U+1F08 U+0399)
ᾁ (U+1F81) => ἉΙ (U+1F09 U+0399)
ᾂ (U+1F82) => ἊΙ (U+1F0A U+0399)
ᾃ (U+1F83) => ἋΙ (U+1F0B U+0399)
ᾄ (U+1F84) => ἌΙ (U+1F0C U+0399)
ᾅ (U+1F85) => ἍΙ (U+1F0D U+0399)
ᾆ (U+1F86) => ἎΙ (U+1F0E U+0399)
ᾇ (U+1F87) => ἏΙ (U+1F0F U+0399)
ᾈ (U+1F88) => ἈΙ (U+1F08 U+0399)
ᾉ (U+1F89) => ἉΙ (U+1F09 U+0399)
ᾊ (U+1F8A) => ἊΙ (U+1F0A U+0399)
ᾋ (U+1F8B) => ἋΙ (U+1F0B U+0399)
ᾌ (U+1F8C) => ἌΙ (U+1F0C U+0399)
ᾍ (U+1F8D) => ἍΙ (U+1F0D U+0399)
ᾎ (U+1F8E) => ἎΙ (U+1F0E U+0399)
ᾏ (U+1F8F) => ἏΙ (U+1F0F U+0399)
ᾐ (U+1F90) => ἨΙ (U+1F28 U+0399)
ᾑ (U+1F91) => ἩΙ (U+1F29 U+0399)
ᾒ (U+1F92) => ἪΙ (U+1F2A U+0399)
ᾓ (U+1F93) => ἫΙ (U+1F2B U+0399)
ᾔ (U+1F94) => ἬΙ (U+1F2C U+0399)
ᾕ (U+1F95) => ἭΙ (U+1F2D U+0399)
ᾖ (U+1F96) => ἮΙ (U+1F2E U+0399)
ᾗ (U+1F97) => ἯΙ (U+1F2F U+0399)
ᾘ (U+1F98) => ἨΙ (U+1F28 U+0399)
ᾙ (U+1F99) => ἩΙ (U+1F29 U+0399)
ᾚ (U+1F9A) => ἪΙ (U+1F2A U+0399)
ᾛ (U+1F9B) => ἫΙ (U+1F2B U+0399)
ᾜ (U+1F9C) => ἬΙ (U+1F2C U+0399)
ᾝ (U+1F9D) => ἭΙ (U+1F2D U+0399)
ᾞ (U+1F9E) => ἮΙ (U+1F2E U+0399)
ᾟ (U+1F9F) => ἯΙ (U+1F2F U+0399)
ᾠ (U+1FA0) => ὨΙ (U+1F68 U+0399)
ᾡ (U+1FA1) => ὩΙ (U+1F69 U+0399)
ᾢ (U+1FA2) => ὪΙ (U+1F6A U+0399)
ᾣ (U+1FA3) => ὫΙ (U+1F6B U+0399)
ᾤ (U+1FA4) => ὬΙ (U+1F6C U+0399)
ᾥ (U+1FA5) => ὭΙ (U+1F6D U+0399)
ᾦ (U+1FA6) => ὮΙ (U+1F6E U+0399)
ᾧ (U+1FA7) => ὯΙ (U+1F6F U+0399)
ᾨ (U+1FA8) => ὨΙ (U+1F68 U+0399)
ᾩ (U+1FA9) => ὩΙ (U+1F69 U+0399)
ᾪ (U+1FAA) => ὪΙ (U+1F6A U+0399)
ᾫ (U+1FAB) => ὫΙ (U+1F6B U+0399)
ᾬ (U+1FAC) => ὬΙ (U+1F6C U+0399)
ᾭ (U+1FAD) => ὭΙ (U+1F6D U+0399)
ᾮ (U+1FAE) => ὮΙ (U+1F6E U+0399)
ᾯ (U+1FAF) => ὯΙ (U+1F6F U+0399)
ᾲ (U+1FB2) => ᾺΙ (U+1FBA U+0399)
ᾳ (U+1FB3) => ΑΙ (U+0391 U+0399)
ᾴ (U+1FB4) => ΆΙ (U+0386 U+0399)
ᾶ (U+1FB6) => Α͂ (U+0391 U+0342)
ᾷ (U+1FB7) => Α͂Ι (U+0391 U+0342 U+0399)
ᾼ (U+1FBC) => ΑΙ (U+0391 U+0399)
ῂ (U+1FC2) => ῊΙ (U+1FCA U+0399)
ῃ (U+1FC3) => ΗΙ (U+0397 U+0399)
ῄ (U+1FC4) => ΉΙ (U+0389 U+0399)
ῆ (U+1FC6) => Η͂ (U+0397 U+0342)
ῇ (U+1FC7) => Η͂Ι (U+0397 U+0342 U+0399)
ῌ (U+1FCC) => ΗΙ (U+0397 U+0399)
ῒ (U+1FD2) => Ϊ̀ (U+0399 U+0308 U+0300)
ΐ (U+1FD3) => Ϊ́ (U+0399 U+0308 U+0301)
ῖ (U+1FD6) => Ι͂ (U+0399 U+0342)
ῗ (U+1FD7) => Ϊ͂ (U+0399 U+0308 U+0342)
ῢ (U+1FE2) => Ϋ̀ (U+03A5 U+0308 U+0300)
ΰ (U+1FE3) => Ϋ́ (U+03A5 U+0308 U+0301)
ῤ (U+1FE4) => Ρ̓ (U+03A1 U+0313)
ῦ (U+1FE6) => Υ͂ (U+03A5 U+0342)
ῧ (U+1FE7) => Ϋ͂ (U+03A5 U+0308 U+0342)
ῲ (U+1FF2) => ῺΙ (U+1FFA U+0399)
ῳ (U+1FF3) => ΩΙ (U+03A9 U+0399)
ῴ (U+1FF4) => ΏΙ (U+038F U+0399)
ῶ (U+1FF6) => Ω͂ (U+03A9 U+0342)
ῷ (U+1FF7) => Ω͂Ι (U+03A9 U+0342 U+0399)
ῼ (U+1FFC) => ΩΙ (U+03A9 U+0399)
ﬀ (U+FB00) => FF (U+0046 U+0046)
ﬁ (U+FB01) => FI (U+0046 U+0049)
ﬂ (U+FB02) => FL (U+0046 U+004C)
ﬃ (U+FB03) => FFI (U+0046 U+0046 U+0049)
ﬄ (U+FB04) => FFL (U+0046 U+0046 U+004C)
ﬅ (U+FB05) => ST (U+0053 U+0054)
ﬆ (U+FB06) => ST (U+0053 U+0054)
ﬓ (U+FB13) => ՄՆ (U+0544 U+0546)
ﬔ (U+FB14) => ՄԵ (U+0544 U+0535)
ﬕ (U+FB15) => ՄԻ (U+0544 U+053B)
ﬖ (U+FB16) => ՎՆ (U+054E U+0546)
ﬗ (U+FB17) => ՄԽ (U+0544 U+053D)
downcase:
İ (U+0130) => i̇ (U+0069 U+0307)

straight-shoota · July 20, 2022, 7:21am

Yes, there are some valid points for criticism. Although I think the blog post might at times be a bit dramatic about it.

In fact, I’ve wondered about the purpose and place of Char while working on the Grapheme API. There is certainly some overlap, and potentially cause for confusion.

I agree that a name such as Codepoint would’ve been a better choice. It would clearly differentiate it from the broader scoped grapheme cluster (sequence of codepoints) as well as the tighter scoped C-style char (single byte).
At this point, a rename would be quite an effort. Hypothetically, we could slowly phase it in as a type alias and automatically transform code to use the updated name. Not sure that’s worth it. Might be best to just embrace the name as it is. It’s not a hard problem, you just need to be conscious about the semantics.
Char is at least shorter

I would strongly refute the argument that Char is useless and shouldn’t have been part of the stdlib API in the first place. It’s very efficient due to the lack of heap allocations. So it provides performance for text processing based on single codepoints (which is often the case in computer languages, for example).
And it’s clearly defined what a codepoint is. Grapheme clusters for example are more fuzzy, because the definition can change with Unicode releases (probably not much and mostly exotic edge cases, but still).

Char might be a bit too prominently represented in the string API, though. That’s not just the Char type itself, but also the default index of String (e.g. for String#[]) is a codepoint index, not a byte or grapheme index. This might not be ideal as it guides the user to use that representation, while others (especially grapheme cluster) might be more appropriate in general use cases.

Grapheme is probably a better default model because it more accurately represents what you would normally expect in most text processing contexts. Using only codepoints or bytes is a performance optimization and you need to be aware of the implications it has for your application.

Perhaps we can try to adjust the string API a little bit more towards prefering grapheme in the future. At least conceptually / in the documentation. For that we also need to expand grapheme support which is still pretty basic for now. Adding upcase and downcase would help that.

straight-shoota · July 20, 2022, 7:32am

I suppose that should be okay. But I’m actually not sure how useful it is to have regular expressions for matching only a single character. You’ll most likely have that be some kind of character class, which you can much more efficiently match with Char’s predicate methods, direct codepoint comparisons or range expressions.
The article uses the example /[0-9]/ to match for a digit. You can just use Char#ascii_number? for that. If a dedicated method didn’t exist, you could use '0' <= c <= '9' or c.in?('0'..'9') as well. All these options are much more efficient than spinning up a regular exception engine.

I fear that adding regex support to Char would do more harm than good, as I don’t see many valid use cases and it would guide users away from better alternatives.
If you actually want to do that, you can just convert the character to a string and use that with a regular expression.

straight-shoota · July 20, 2022, 7:36am

I wouldn’t mind to enable equality check between Char and String.
Perhaps that would be something for the case equality operator (===)? There is already Char#===(Int) which works with a codepoint number and thus is “type insensitive”.

zw963 · July 20, 2022, 4:09pm

Char and String, it totally same from the angle of developer (ignore performance impact), but Char and Int, are totally different things.

so, i think == is better, it treat as a special example, no any bad effect, instead, 'a' == "A" is true more naturally.

straight-shoota · July 20, 2022, 4:43pm

Char is a codepoint, thus a number. It can be represented as the character itself, or as the number of the codepoint. That’s similar to how there are different representations of numbers in different bases. 'a' is just the number 97, just like 0x61 as well. They all mean the same thing when interpreted as a character.

zw963 · July 21, 2022, 7:36am

Yes, Char is codepoint, but Char really has nothing to do with Integer, i consider Chat#===(Int) is not so useful, even, i consider it harmful, following code is more clear.

case 97.chr
when 'a'
  puts 'a'
when 'b'
  puts 'b'
end

or

case 'a'.ord
when 97
  puts "97"
when 98
  puts "98"
end

From the angle of Crystal user(not from ‘A’ internal store for), ‘A’ same as “A”, in fact, it save as binary form, when present as a codepoint, it use hexadecimal.

beta-ziliani · July 25, 2022, 7:39pm

I agree to everything @straight-shoota said.

I just want to add that I think c =~ /[0-9]/ should be a type error for c a Char.

Topic		Replies	Views
[Mini Review] Giving up on Crystal	14	11086	March 24, 2019
String.new with Pointer(Char) Crystal Contrib	5	762	May 19, 2022
Character classes in Regex Help & Support	6	864	October 8, 2021
String#ascii_only? Crystal Contrib	17	454	November 23, 2024
The core principles behind Crystal Community	34	3379	May 27, 2020

Fair critics about Crystal's Char type

Related topics