Convert a character to bits?

drhuffman12 · February 24, 2021, 4:44am

How do I convert an Ascii (or unicode) character to an array of bits?

drhuffman12 · February 24, 2021, 5:15am

Oh, got it. :

file_path = "/path/to/your/file"
bits = Array(Array(Int32)).new
File.read(file_path).each_char { |char|
  bytes = char.ord # 4 Bytes, so 32 bits
  bits << (0..31).to_a.map { |i| bytes.bit(i) }
}
puts bits

Blacksmoke16 · February 24, 2021, 5:39am

Depending on the exact output you want, you could also do something like:

bits = File.open FILE_PATH do |file|
  # Built the bits out as a string, otherwise may run into Integer size limits pretty quickly.
  String.build do |io|
    file.each_byte do |byte|
      byte.to_s io, base: 2
    end
  end
end

pp bits

Main benefit is it should be a bit more performant given you don’t need to load the file contents directly into memory, and are iterating on each byte so don’t need the chr.ord call, or need to create the immediary array.

But actually after benchmarking it, it seems yours is actually faster. Interesting…

String IO 122.82k (  8.14µs) (± 1.22%)  8.88kB/op   1.39× slower
 2D Array 170.39k (  5.87µs) (± 0.95%)  2.71kB/op        fastest

Maybe LLVM is able to do some optimizations here? .

EDIT: Benchmark code is:

FILE_PATH = "./data.txt"

require "benchmark"

Benchmark.ips do |x|
  x.report "String IO" do
    File.open FILE_PATH do |file|
      # Built the bits out as a string, otherwise may run into Integer size limits pretty quickly.
      String.build do |io|
        file.each_byte do |byte|
          byte.to_s io, base: 2
        end
      end
    end
  end

  x.report "2D Array" do
    bits = Array(Array(Int32)).new
    File.read(FILE_PATH).each_char do |char|
      bytes = char.ord # 4 Bytes, so 32 bits
      bits << (0..31).to_a.map { |i| bytes.bit(i) }
      bits
    end
    bits
  end
end

Where data.txt is just foo.

drhuffman12 · February 24, 2021, 6:39am

@Blacksmoke16 , Cool; thanks!

j8r · February 24, 2021, 5:54pm

I got the 2D array slowed on my side

String IO 222.86k (  4.49µs) (± 1.81%)  8.83kB/op        fastest
 2D Array 160.52k (  6.23µs) (± 2.26%)  4.64kB/op   1.39× slower

Bigger the file is, bigger the difference is.

j8r · February 24, 2021, 6:24pm

Also, I’d use BitArray instead of String or Array:

require "bit_array"

# In bytes
bits_size = File.size(FILE_PATH) * 8
bit_array = BitArray.new bits_size.to_i

i = 0
File.open FILE_PATH, &.each_byte do |byte|
  byte.bit_length.times do |n|
    bit_array[i] = byte.bit(n) == 1 ? true : false
    i += 1
  end
end

p bit_array[0..i]

Not sure it is correct, though. I don’t know why bits_size of the file is higher than i, maybe because of metadata?

Each solution here produces a different result

straight-shoota · February 24, 2021, 11:28pm

Those solution may all be good and working. But when talking about character to bit conversion, the main question is: what kind of encoding are you looking for?

drhuffman12 · February 25, 2021, 2:38am

what kind of encoding are you looking for

A list of bits per character. e.g.: “abd” would become something like

[
  [1,0,1,0,...], # or what ever matches the first character
  ...
  [0,1,0,1,...], # or what ever matches the last character
]

… or preferably as floats like [[1.0,0.0,...]].

The above comments get me close enough for now, but I welcome optimizations.

For sake of unicode vs ascii possibilities and for sake of converting back to characters and consistent data format, I think each_char is preferable over each_byte. However, compactness (storage-wise), BitArray does sound good.

asterite · February 25, 2021, 9:58am

Another question is: what do you need this for?

straight-shoota · February 25, 2021, 12:41pm

That’s rather unspecific since there are lots of different possible results based on character encoding and byte format.

drhuffman12 · February 26, 2021, 3:36am

@asterite ,

Another question is: what do you need this for?

This is for consuming text data and converting to RNN inputs/outputs in GitHub - drhuffman12/ai4cr: Artificial Intelligence for Crystal (based on https://github.com/SergioFierens/ai4r) .

@straight-shoota ,

I apologize if I’m not being specific enough. Based on how the .each_char method handles a sample text file I tried, it seemed like it would handle both ASCII and UTFx and seemed like it would consistent give me 4 bytes of bits. But, I probably should compare some other files and research how Crystal handles “character encoding and byte format”.

straight-shoota · February 26, 2021, 12:47pm

You don’t need to worry about how Crystal represents strings (that’s UTF-8 and thus byte order doesn’t matter). It just depends on what representation is required for your specific application.

drhuffman12 · February 26, 2021, 7:22pm

@straight-shoota , good to know; thanks!

Topic		Replies	Views
Loop two String.each_char blocks at the same time Help & Support	8	337	March 3, 2022
Do ascii/binary strings exist? Help & Support	25	516	March 21, 2022
String#ascii_only? Crystal Contrib	17	452	November 23, 2024
Read chunks from file Help & Support	5	205	March 29, 2024
A blog article on performant vs idiomatic code (using Crystal examples) News blog	15	740	February 11, 2024

Convert a character to bits?

Related topics