Do ascii/binary strings exist?

Hello everyone,

I’m using a lot of data which would be in ruby strings with the ascii or binary encoding. If I got it right, there is no direct equivalent in crystal, so I’m using u8 slices, arrays, etc instead. It works, but the dev part is not much fun (at least not compared to strings in ruby). Is there a better way, which I missed? I would love to have “\x01\x02foo\x00bar\x01” to be usable in code, but also as output for debugging, or just in general the look & feel (and handling) of strings (basically the same way, as it is possible in ruby or some other languages)

You can use a String to hold bytes, there’s absolutely no problem with that.

Could you expand a bit on what you are working on, and which parts are not fun?

1 Like

It’s a lot of reading/writing binary data, sometimes file related, but more often related to sockets.
I was afraid strings wouldn’t be suitable as the data is unlikely to be utf8 friendly. That’s why I’m using IO::Memory and u8 slices and arrays instead.

This
"a\tlocalhost\x06\x02\x04\x00\x00\x80\x01D\xF8A3\xB7\xED\x9B\x98\x85\xEE\x80\xAA#x\x19\t\xCC\xEF\xA8i\xF5N6\xC3R\x81\x1EM\x05\xA8\xBF\xDE\x1E\xEB}\x8Bv<\xB0~\x01\x1C\xEF\x94\xD2\xE7\xD0\xE3\x9E\xEBZD\x1CC^.Yf\xC1B$\xCC.\xDE\x13K\xA9\x92\xD5k\xA8\xBC\xED$\x11\xCD\xA7R\xB1\xFB\xEA\x7F\x16\x9F\xF0\xB3h\t\xC3\x03\x95\xD0\x01\xC1)O\xA6V\xBE,ex\x04\xA3H\xFA\x1A\xF0\xF1$n\xA5\xA5\x1Ea\xC6%C8\x1A:\fUT@\xAB\xCD\x10\xE0\xD0\xBEeJ\x83\x19+\x8F\xE6,=\xEC\x92\xA0\xD5"`

would be so much nicer to read than
Bytes[97, 9, 108, 111, 99, 97, 108, 104, 111, 115, 116, 6, 2, 4, 0, 0, 128, 1, 68, 248, 65, 51, 183, 237, 155, 152, 133, 238, 128, 170, 35, 120, 25, 9, 204, 239, 168, 105, 245, 78, 54, 195, 82, 129, 30, 77, 5, 168, 191, 222, 30, 235, 125, 139, 118, 60, 176, 126, 1, 28, 239, 148, 210, 231, 208, 227, 96, 158, 235, 90, 68, 28, 67, 94, 46, 89, 102, 193, 66, 36, 204, 46, 222, 19, 75, 169, 146, 213, 107, 168, 188, 237, 36, 17, 205, 167, 82, 177, 251, 234, 127, 22, 159, 240, 179, 104, 9, 195, 3, 149, 208, 1, 193, 41, 79, 166, 86, 190, 44, 101, 120, 4, 163, 72, 250, 26, 240, 241, 36, 110, 165, 165, 30, 97, 198, 37, 67, 56, 26, 58, 12, 85, 84, 64, 171, 205, 16, 224, 208, 190, 101, 74, 131, 25, 43, 143, 230, 44, 61, 236, 146, 160, 213]

While it might look not much different regarding readability at first glance, it usually really is, either because of actual ascii strings (like the “localhost” in my example) or to compare it with hexeditors, outputs from other software, etc. The longer the data fragments are, the more often you have somewhere some recognisable ascii in them, which I can use as reference points during debugging (currently I spend a lot of time counting bytes or sit there with two fingers on my screen, slowly comparing byte after byte of my actual output with the data I try to get). This would still happen for example in ruby, but much less, as I could skip huge parts.

It’s also so much easier and faster to recognise when data is in a wrong format (for example when encoding and encryption steps are involved and some step has been missed or with the wrong format, etc). Like here for example: Bytes[182, 96, 93, 233, 226, 42, 74, 124, 203, 237, 137, 245, 209, 136, 127, 165] versus Bytes [98, 54, 54, 48, 53, 100, 101, 57, 101, 50, 50, 97, 52, 97, 55, 99, 99, 98, 101, 100, 56, 57, 102, 53, 100, 49, 56, 56, 55, 102, 97, 53] is much harder in order to recognise a difference, bit it would become immediately obvious as in "\xB6]\xE9\xE2*J|\xCB\xED\x89\xF5ш\u007F\xA5"versus“b6605de9e22a4a7ccbed89f5d1887fa5”` (could be an example for hashed data, one time binary, the other time as hex output, like the hmac hash method of OpenSSL offers both)

It becomes even more inconvenient when I want to use some string in code (fixed text, eg. ascii readable file headers) - currently I convert those in ruby, and copy/paste the array of u8 values to my crystal source, because I also don’t want to have it converted from string to byte during runtime (but it would be nice to be able to just use anyway the ascii representation already in my code)

If the problem is being able to see the data is a nice way, you can call hexstring on those bytes and it will print exactly that. You could even override to_s(io) if you wanted, to show things as hextrings.

We can’t change the string representation by default because then you wouldn’t know, at a glance, if what you are seeing is a string or a Bytes. The ideal scenario would be to show it like b"..." and have b"..." be a literal of bytes. I’d really like that to happen, but I don’t know if it’s ever going to happen.

You can also call hexdump and it will show things in a way that will be much nicer to debug, even nicer than that “\x” thing.

For the first bytes you mentioned, this is the hexdump:

00000000  61 09 6c 6f 63 61 6c 68  6f 73 74 06 02 04 00 00  a.localhost.....
00000010  80 01 44 f8 41 33 b7 ed  9b 98 85 ee 80 aa 23 78  ..D.A3........#x
00000020  19 09 cc ef a8 69 f5 4e  36 c3 52 81 1e 4d 05 a8  .....i.N6.R..M..
00000030  bf de 1e eb 7d 8b 76 3c  b0 7e 01 1c ef 94 d2 e7  ....}.v<.~......
00000040  d0 e3 60 9e eb 5a 44 1c  43 5e 2e 59 66 c1 42 24  ..`..ZD.C^.Yf.B$
00000050  cc 2e de 13 4b a9 92 d5  6b a8 bc ed 24 11 cd a7  ....K...k...$...
00000060  52 b1 fb ea 7f 16 9f f0  b3 68 09 c3 03 95 d0 01  R........h......
00000070  c1 29 4f a6 56 be 2c 65  78 04 a3 48 fa 1a f0 f1  .)O.V.,ex..H....
00000080  24 6e a5 a5 1e 61 c6 25  43 38 1a 3a 0c 55 54 40  $n...a.%C8.:.UT@
00000090  ab cd 10 e0 d0 be 65 4a  83 19 2b 8f e6 2c 3d ec  ......eJ..+..,=.
000000a0  92 a0 d5                                          ...

The tools to debug these things are in the standard library, they just aren’t the default ones used by to_s or inspect.

2 Likes

Ah, no, hexstring is not what I thought…

But if you convert the bytes to a String and call inspect (or use top-level p) then it works:

bytes = Bytes[97, 9, 108, 111, 99, 97, 108, 104, 111, 115, 116, 6, 2, 4, 0, 0, 128, 1, 68, 248, 65, 51, 183, 237, 155, 152, 133, 238, 128, 170, 35, 120, 25, 9, 204, 239, 168, 105, 245, 78, 54, 195, 82, 129, 30, 77, 5, 168, 191, 222, 30, 235, 125, 139, 118, 60, 176, 126, 1, 28, 239, 148, 210, 231, 208, 227, 96, 158, 235, 90, 68, 28, 67, 94, 46, 89, 102, 193, 66, 36, 204, 46, 222, 19, 75, 169, 146, 213, 107, 168, 188, 237, 36, 17, 205, 167, 82, 177, 251, 234, 127, 22, 159, 240, 179, 104, 9, 195, 3, 149, 208, 1, 193, 41, 79, 166, 86, 190, 44, 101, 120, 4, 163, 72, 250, 26, 240, 241, 36, 110, 165, 165, 30, 97, 198, 37, 67, 56, 26, 58, 12, 85, 84, 64, 171, 205, 16, 224, 208, 190, 101, 74, 131, 25, 43, 143, 230, 44, 61, 236, 146, 160, 213]
p String.new(bytes)

Output:

"a\tlocalhost\u0006\u0002\u0004\u0000\u0000\x80\u0001D\xF8A3\xB7훘\x85\uE02A#x\u0019\t\xCC\xEF\xA8i\xF5N6\xC3R\x81\u001EM\u0005\xA8\xBF\xDE\u001E\xEB}\x8Bv<\xB0~\u0001\u001C\xEF\x94\xD2\xE7\xD0\xE3`\x9E\xEBZD\u001CC^.Yf\xC1B$\xCC.\xDE\u0013K\xA9\x92\xD5k\xA8\xBC\xED$\u0011ͧR\xB1\xFB\xEA\u007F\u0016\x9F\xF0\xB3h\t\xC3\u0003\x95\xD0\u0001\xC1)O\xA6V\xBE,ex\u0004\xA3H\xFA\u001A\xF0\xF1$n\xA5\xA5\u001Ea\xC6%C8\u001A:\fUT@\xAB\xCD\u0010\xE0оeJ\x83\u0019+\x8F\xE6,=쒠\xD5"
1 Like

I was a bit shocked when you suggested String anyway (not as in “how dare you?” but as in “did I maybe had the hassle of working with bytes for no reason”). So I gave it a try, and it seemed all to work fine… until it didn’t.

I don’t remember if there were other issues as well, but at least this can be a problem:

a= Bytes[15, 221, 23, 105, 240, 159, 152, 128, 35, 33, 83, 125, 108, 197, 146, 22, 54, 116]
b= String.new a
p a[10..11]
p String.new a[10..11]
p b[10..11]

I would expect the 2nd and 3rd output to be exactly the same, and the 1st output to be its byte representation. But that’s not the case, because within the data, there have been bytes which were misinterpreted as unicode. The second and the third output don’t refer to the same segment (in this case different offset, but in other cases they could also be of different length). Just for clarification: this would be happen in ruby as well, unless I set the encoding of the string to binary or ascii (which I think isn’t possible to do in crystal).

Right, that’s not the case. A string assumes its bytes to be UTF-8 encoded, and indexing a string will use codepoints for indexes. If you want to use bytes you can use Slice(UInt8) or Bytes (the alias) for that. That’s the main difference.

Nothing prevents you from using a String as bytes, though. You can call to_slice on it and then do this:

a= Bytes[15, 221, 23, 105, 240, 159, 152, 128, 35, 33, 83, 125, 108, 197, 146, 22, 54, 116]
b= String.new a
p a[10..11]
p String.new a.to_slice[10..11]
p b.to_slice[10..11]

and it will print the same thing, because going to a slice goes from string to bytes.

The main question is: why do you need to go from bytes to string to bytes and so on? If you only need to see the bytes representation as a string, you can do String.new(bytes).inspect, for debugging purposes, but then always work with bytes, never with a string.

2 Likes

String#[] operates on characters, which is not identical to the byte representation. a[10..11] refers to the 10th and 11th character which could equal to anywhere between 10 and 40 bytes.

You can use #to_slice as @asterite suggested. Or #byte_slice which returns a substring at the given byte positions.

a = Bytes[15, 221, 23, 105, 240, 159, 152, 128, 35, 33, 83, 125, 108, 197, 146, 22, 54, 116]
String.new(a[10..11])           # => "S}"
String.new(a).byte_slice(10, 2) # => "S}"

(String#byte_slice doesn’t accept a range parameter. I’m sure this could trivially be added.)

4 Likes

Hmmm :thinking: I guess by now, it would be really mostly just for debugging purposes. Somehow, I can’t see myself using String.new(bytes).inspect in the future though (maybe laziness, maybe the fear that I would occasionally mess something up (it’s for sure easier to just add a p together with a pair of parentheses), at least not directly (maybe I might just define a modified version of p!?).

But to be honest, I guess the nature of my question was mostly more about if there is maybe a better solution out of the box which I might have missed, than about finding a new solution per se. (maybe compare it to my question where one would go to look for shards - I’m so glad I asked, because I definitely learned a lot from the answers - even though I saw this morning that the same links have been already on the crystal page and I must have simply missed them).

But I’m also thinking about just heavily monkey-patching String directly for future projects (I have basically no need for unicode support at all (there might sometimes indeed unicode be included, but whenever I need the size or position of something, it’ll be related to the actual bytes), so there isn’t anything I would lose and the “stains” of monkey patching wouldn’t matter either, as String is anyway one of the few remaining classes being still untouched). In my head I’m still seeing the data as Strings (old habits die hard) and I just wouldn’t need to think about it anymore. I’ll see.

Thanks a lot though!

I guess I had really hoped (if even ‘hoped’ - I guess I really was just wondering if) there is a way to use String the same way as I’ve been using it in some other languages for quite a while now (just for clarification: this has absolutely no disappointment or even complain included, just some surprise).

Thanks :)

Would you mind telling what those other languages are? I think most of them make a clear distinction between bytes and strings. The only exception would be Ruby, so maybe it was Ruby? :slight_smile:

1 Like

I don’t follow either. What “same way” would that be? You can use String pretty much as a sequence of arbitrary bytes. Methods like #byte_slice, #byte_index, #byte_at etc. provide an API for using it with byte indeces if you don’t care about Unicode codepoints and the resulting character indices.

1 Like

I don’t know which or if any of them had it selectable as in ruby, but I definitely used those in such a byte-y behaviour: python and elixir (probably less surprising, as interpreted languages are in general quite “forgiving”), but I was also able to do so in Delphi, VB and Clarion (the are all outdated as heck; maybe it’s related to their age, maybe it isn’t; the last couple of years I have dealt almost completely only with interpreted languages). Swift is new (and is hyper confusing to me) but also has ascii strings (kinda comparable with ruby - you choose the encoding, with utf8 and ascii being possible options). And I’m almost certain, C++ also worked in such a byte-y way (I might be wrong though, it’s been too long ago)

Same way as in: a string is basically just a fancy container for bytes (which just shows the character representation if one is available, otherwise \xAB). Even though I’m unsure right now, if this might be actually not be also possible here indeed (with your examples (especially byte_slice), right now I can’t think of something, which would be still missing). I guess, I’ll have to give Strings another go.

Maybe I got just too convinced by the docs saying Strings would be automatically utf8 (or maybe I was interpreting too much into that information).

I’ll give it another go and will report back (either with remaining differences if I find any or agreeing that it’s also fit for binary purposes). When I tried it the first time I quickly started to struggle badly (which doesn’t have to mean much as it was my very first day with Crystal). We’ll see

It seems String might indeed be fine to be used with binary content. All necessary methods seem to be available. Although: It’s really easy to accidentally pick the wrong method (eg. size instead of bytesize) and such a mistake can stay unnoticed for a long time (and once the bug got noticed, it can take quite a while to figure out its nature) as the wrong code just reads too well, will be accepted by the compiler just fine, and will even work just fine for the majority of random test data (=> easy to happen, but potentially really, really hard to debug). Picking the right methods is likely just a matter of exercise and time. I also had to look up the exact method names quite often as they felt sometimes less intuitive (is it byte_size or bytesize? And is it bytes, to_bytes, byteslice, byte_slice, to_byte_slice?), but this too, should be irrelevant once one got used to it.

My personal conclusion: String feels(!) unnatural to me, more of a hasty hack than a first class citizen. I will probably just stick to Bytes, maybe with a new type alias, and some added sprinkles for p, etc.

But String seems to work for binary content in general. I think the docs maybe could reflect this a bit better (when I had issues using String, I checked the docs, and the intro description gave me the idea and seemed to confirm my impression that what I’m trying to do might simply be too far off the class’ intended purpose; and therefore left String before I could figure out that I actually could have just used it). Maybe adjust it a bit, something with a meaning like “full unicode support, while ascii/binary use remains possible” - it isn’t obvious, especially if one got already a wrong idea.

Ruby’s String is mutable until frozen, and has #getbyte and #setbyte; in fact, an ASCII-8BIT string is probably the closest thing you could get for a mutable byte array in vanilla Ruby. In contrast, Crystal’s is immutable, and trying to get around this limitation is undefined behaviour:

str = String.build { |io| io << "01234567" }
str.size # => 8
bytes = Slice.new(str.to_unsafe, str.bytesize)
bytes[3] = 0xC3
bytes[4] = 0x90
str      # => "012Ð567"
str.size # => 8

That’s another reason one should work with Bytes instead of String in Crystal unless necessary for debugging purposes.

3 Likes

To add a bit more: a string let’s you work with characters. Bytes lets you work with bytes. A stringvs characters are encoded or interpreted using UTF-8, and the underlying representation is exposed in some methods, like byte_slice. But if you want to work most of the time with characters, you use a String. If you need to work with bytes, and potentially on a mutable space, you use Bytes.

Java has String like this, and then you can use byte[]. In C# it’s similar. Also in Rust.

Maybe older languages don’t make this distinction, but I’d say most modern languages do.

I also want to clarify that String is not a hack. There’s been a lot of thought put into it. It didn’t grow organically (like hacks do)

The “hack” wasn’t meant insulting, nor on String in general. Just the byte handling felt a bit that way to me (but again: I mean no offence). The “normal” String handling follows that of other objects, but to access the bytes it feels a bit off (like the method names, etc). But this is likely just my personal view.

1 Like