I18n: Internationalization shard that handles interpolated strings, with compile-time checking

I have created a BrucePerens/i18n shard, which is a different approach to internationalization than I’ve seen on Crystal so far.

  • It handles interpolated strings. You don’t have to break them up into smaller strings to be translated, a macro does that for you.
  • It checks at compile time for the existence of a translation of a given string to all languages that have been set up. If a translation is missing, it raises an error with a helpful message.
  • It helps you generate translation tables, by emitting a string table at compile time, if you set a flag.

This is alpha-quality code so far. But given the new ideas, I thought it would be best to share it ASAP.

Thanks

Bruce

7 Likes

IME it’s important that a translator has the freedom to rearrange any interpolations. Most translation systems I’ve seen deal with this with some form of format specifier style expansion. So for example that “Let’s go to %s” could get translated into “Lass uns nach %s gehen”. How does your approach deal with this, in an to the translator easy to understand way? You describe that it handles interpolation but never show any examples of it :)

I was about to comment exactly the same thing. Furthermore, interpolations might appear in different (swapped) places depending on the language.

Rearrangement of interpolations is generally necessary when there are two or more in the same string. Otherwise, an interpolation can occur in only three places: at the beginning before a fixed string, at the end after a fixed string, and between two fixed strings. These are relatively easy for the translator to deal with. We can similarly classify two interpolations as having at most three fixed strings separated by interpolations.

IMO dealing with this is not the highest priority problem that faces internationalization. The biggest problem is its cost to the programmer. Most internationalization makes your code ugly and takes energy and time to code. So, I wanted to handle the most general situation in the easiest way possible for the programmer, which is that you put t , one letter and a space, before each string, even interpreted strings.

Paul Smith (of Lucky fame) pointed out that it isn’t sufficient to simply send interpolated strings to the translator, you should have some sort of comment system so that you can tell the translator what the things in the interpolations are. You see, in many languages those things have different genders, even though they are inanimate, and the translator can’t produce a good translation without determining the gender of your model field. Did you know that Switzerland is a female country and Germany is a male country, in the German language? Seems crazy to an English speaker.

I am also concerned with the fact that we have a great interpolation feature in the language, and most people building translation facilities start by adding a second interpolation system that is different from the first. Maybe we need to think harder about how to use the first one.

So, I haven’t dealt with that at all yet. I will do so eventually, but firmly believe that some of those facilities will be used 1% of the time. It might be important for that 1%. I am working on more common cases first.

Here is the string table the program presently emits, so that you can get a clue how it breaks up strings. Note that white space around an interpolation is not translated, as having translations that begun or ended with white space would be error-prone.

"Korean" => "Korean", # /home/bruce/Crystal/UserCorps/usercorps/src/pages/edit/edit_language.cr:105
# Interpolated String "Language page #{id.to_i} was deleted." at /home/bruce/Crystal/UserCorps/usercorps/src/actions/edit/language/delete_language.cr:9
"Language page" => "Language page", # Interpolated at /home/bruce/Crystal/UserCorps/usercorps/src/actions/edit/language/delete_language.cr:9
"was deleted." => "was deleted.", # Interpolated at /home/bruce/Crystal/UserCorps/usercorps/src/actions/edit/language/delete_language.cr:9
# Interpolated String "Language page #{id.to_i}, for \"#{o.tag}\", could not be deleted: #{e.message}" at /home/bruce/Crystal/UserCorps/usercorps/src/actions/edit/language/delete_language.cr:11
"Language page" => "Language page", # Interpolated at /home/bruce/Crystal/UserCorps/usercorps/src/actions/edit/language/delete_language.cr:11
", for" => ", for", # Interpolated at /home/bruce/Crystal/UserCorps/usercorps/src/actions/edit/language/delete_language.cr:11
", could not be deleted:" => ", could not be deleted:", # Interpolated at /home/bruce/Crystal/UserCorps/usercorps/src/actions/edit/language/delete_language.cr:11
# Interpolated String "The requested Language page #{id.to_i} was not found." at /home/bruce/Crystal/UserCorps/usercorps/src/actions/edit/language/delete_language.cr:14

Thanks

Bruce

1 Like

When you split a sentence like "Language page #{id.to_i} was deleted." into "Language page", interpolated value, "was deleted.", it seems like any contextual binding between those fragments is lost. Is that correct?
I highly doubt that this could effectively work. Natural languages can have really complex grammars. Splitting a sentence into multiple pieces and translating them individually does not really seem to be a valid option.
Even if only 1% is affected, correct language is such an important tool that translations must be capable 100%.

For example, if we add an actor to the sentence, the english variant is "Language page #{id.to_i} was deleted by #{user.name}.". A german translation would be "Die Sprachseite #{id.to_i} wurde von #{user.name} gelöscht."
wurde ... gelöscht is the predicate. Those two words form an integral part of the sentence, but are at separate locations with a different phrase in between. If they are split across multiple translation pieces, chances are high that things get messed. The system would have to ensure that both pieces are always used together. In that case, there doesn’t seem to be a huge benefit in splitting up in the first place.

Note that this is still a relatively easy example. Some natural language grammars are not even context-free due to cross-serial dependencies, which leads to even more complex relations between individual translation pieces.

Hard-coding whitespace locations around interpolations in the original language could also lead to issues due to different whitespace and composition rules in other languages.

I did work out how a translator can re-order an interpolated string. Fortunately, we can continue to code in Crystal’s interpolated string syntax, instead of reinventing one. Each expression in the string is set up at compile time to put its result in an array of strings at run time. The translator can take a string like “A #{1+1} B #{2+2} C” and write a translation string like this: “a @2 b @1 c”, where “@2” would select the second expression in the original translated string. So, we don’t change how the string is coded at all, and invent the minimum necessary syntax for the translator.

The remaining problem is how to annotate the string to give the translator some additional context, so that the translator can make noun gender decisions, etc. All I can think of is adding additional arguments which can be used for that.

Thanks

Bruce