Convert a character to bits?

How do I convert an Ascii (or unicode) character to an array of bits?

Oh, got it. ::slight_smile:

file_path = "/path/to/your/file"
bits = Array(Array(Int32)).new
File.read(file_path).each_char { |char|
  bytes = char.ord # 4 Bytes, so 32 bits
  bits << (0..31).to_a.map { |i| bytes.bit(i) }
}
puts bits

Depending on the exact output you want, you could also do something like:

bits = File.open FILE_PATH do |file|
  # Built the bits out as a string, otherwise may run into Integer size limits pretty quickly.
  String.build do |io|
    file.each_byte do |byte|
      byte.to_s io, base: 2
    end
  end
end

pp bits

Main benefit is it should be a bit more performant given you don’t need to load the file contents directly into memory, and are iterating on each byte so don’t need the chr.ord call, or need to create the immediary array.

But actually after benchmarking it, it seems yours is actually faster. Interesting…

String IO 122.82k (  8.14µs) (± 1.22%)  8.88kB/op   1.39× slower
 2D Array 170.39k (  5.87µs) (± 0.95%)  2.71kB/op        fastest

Maybe LLVM is able to do some optimizations here? :man_shrugging:.

EDIT: Benchmark code is:

FILE_PATH = "./data.txt"

require "benchmark"

Benchmark.ips do |x|
  x.report "String IO" do
    File.open FILE_PATH do |file|
      # Built the bits out as a string, otherwise may run into Integer size limits pretty quickly.
      String.build do |io|
        file.each_byte do |byte|
          byte.to_s io, base: 2
        end
      end
    end
  end

  x.report "2D Array" do
    bits = Array(Array(Int32)).new
    File.read(FILE_PATH).each_char do |char|
      bytes = char.ord # 4 Bytes, so 32 bits
      bits << (0..31).to_a.map { |i| bytes.bit(i) }
      bits
    end
    bits
  end
end

Where data.txt is just foo.

@Blacksmoke16 , Cool; thanks! :slight_smile:

I got the 2D array slowed on my side

String IO 222.86k (  4.49µs) (± 1.81%)  8.83kB/op        fastest
 2D Array 160.52k (  6.23µs) (± 2.26%)  4.64kB/op   1.39× slower

Bigger the file is, bigger the difference is.

1 Like

Also, I’d use BitArray instead of String or Array:

require "bit_array"

# In bytes
bits_size = File.size(FILE_PATH) * 8
bit_array = BitArray.new bits_size.to_i

i = 0
File.open FILE_PATH, &.each_byte do |byte|
  byte.bit_length.times do |n|
    bit_array[i] = byte.bit(n) == 1 ? true : false
    i += 1
  end
end

p bit_array[0..i]

Not sure it is correct, though. I don’t know why bits_size of the file is higher than i, maybe because of metadata?

Each solution here produces a different result :confused:

Those solution may all be good and working. But when talking about character to bit conversion, the main question is: what kind of encoding are you looking for?

what kind of encoding are you looking for

A list of bits per character. e.g.: “abd” would become something like

[
  [1,0,1,0,...], # or what ever matches the first character
  ...
  [0,1,0,1,...], # or what ever matches the last character
]

… or preferably as floats like [[1.0,0.0,...]].

The above comments get me close enough for now, but I welcome optimizations.

For sake of unicode vs ascii possibilities and for sake of converting back to characters and consistent data format, I think each_char is preferable over each_byte. However, compactness (storage-wise), BitArray does sound good.

Another question is: what do you need this for?

That’s rather unspecific since there are lots of different possible results based on character encoding and byte format.

@asterite ,

Another question is: what do you need this for?

This is for consuming text data and converting to RNN inputs/outputs in GitHub - drhuffman12/ai4cr: Artificial Intelligence for Crystal (based on https://github.com/SergioFierens/ai4r) .


@straight-shoota ,

I apologize if I’m not being specific enough. Based on how the .each_char method handles a sample text file I tried, it seemed like it would handle both ASCII and UTFx and seemed like it would consistent give me 4 bytes of bits. But, I probably should compare some other files and research how Crystal handles “character encoding and byte format”.

You don’t need to worry about how Crystal represents strings (that’s UTF-8 and thus byte order doesn’t matter). It just depends on what representation is required for your specific application.

@straight-shoota , good to know; thanks! :slight_smile: