How do I convert an Ascii (or unicode) character to an array of bits?
Oh, got it. :
file_path = "/path/to/your/file"
bits = Array(Array(Int32)).new
File.read(file_path).each_char { |char|
bytes = char.ord # 4 Bytes, so 32 bits
bits << (0..31).to_a.map { |i| bytes.bit(i) }
}
puts bits
Depending on the exact output you want, you could also do something like:
bits = File.open FILE_PATH do |file|
# Built the bits out as a string, otherwise may run into Integer size limits pretty quickly.
String.build do |io|
file.each_byte do |byte|
byte.to_s io, base: 2
end
end
end
pp bits
Main benefit is it should be a bit more performant given you don’t need to load the file contents directly into memory, and are iterating on each byte so don’t need the chr.ord
call, or need to create the immediary array.
But actually after benchmarking it, it seems yours is actually faster. Interesting…
String IO 122.82k ( 8.14µs) (± 1.22%) 8.88kB/op 1.39× slower
2D Array 170.39k ( 5.87µs) (± 0.95%) 2.71kB/op fastest
Maybe LLVM is able to do some optimizations here? .
EDIT: Benchmark code is:
FILE_PATH = "./data.txt"
require "benchmark"
Benchmark.ips do |x|
x.report "String IO" do
File.open FILE_PATH do |file|
# Built the bits out as a string, otherwise may run into Integer size limits pretty quickly.
String.build do |io|
file.each_byte do |byte|
byte.to_s io, base: 2
end
end
end
end
x.report "2D Array" do
bits = Array(Array(Int32)).new
File.read(FILE_PATH).each_char do |char|
bytes = char.ord # 4 Bytes, so 32 bits
bits << (0..31).to_a.map { |i| bytes.bit(i) }
bits
end
bits
end
end
Where data.txt
is just foo
.
@Blacksmoke16 , Cool; thanks!
I got the 2D array slowed on my side
String IO 222.86k ( 4.49µs) (± 1.81%) 8.83kB/op fastest
2D Array 160.52k ( 6.23µs) (± 2.26%) 4.64kB/op 1.39× slower
Bigger the file is, bigger the difference is.
Also, I’d use BitArray instead of String
or Array
:
require "bit_array"
# In bytes
bits_size = File.size(FILE_PATH) * 8
bit_array = BitArray.new bits_size.to_i
i = 0
File.open FILE_PATH, &.each_byte do |byte|
byte.bit_length.times do |n|
bit_array[i] = byte.bit(n) == 1 ? true : false
i += 1
end
end
p bit_array[0..i]
Not sure it is correct, though. I don’t know why bits_size
of the file is higher than i
, maybe because of metadata?
Each solution here produces a different result
Those solution may all be good and working. But when talking about character to bit conversion, the main question is: what kind of encoding are you looking for?
what kind of encoding are you looking for
A list of bits per character. e.g.: “abd” would become something like
[
[1,0,1,0,...], # or what ever matches the first character
...
[0,1,0,1,...], # or what ever matches the last character
]
… or preferably as floats like [[1.0,0.0,...]]
.
The above comments get me close enough for now, but I welcome optimizations.
For sake of unicode vs ascii possibilities and for sake of converting back to characters and consistent data format, I think each_char
is preferable over each_byte
. However, compactness (storage-wise), BitArray does sound good.
Another question is: what do you need this for?
That’s rather unspecific since there are lots of different possible results based on character encoding and byte format.
Another question is: what do you need this for?
This is for consuming text data and converting to RNN inputs/outputs in GitHub - drhuffman12/ai4cr: Artificial Intelligence for Crystal (based on https://github.com/SergioFierens/ai4r) .
I apologize if I’m not being specific enough. Based on how the .each_char
method handles a sample text file I tried, it seemed like it would handle both ASCII and UTFx and seemed like it would consistent give me 4 bytes of bits. But, I probably should compare some other files and research how Crystal handles “character encoding and byte format”.
You don’t need to worry about how Crystal represents strings (that’s UTF-8 and thus byte order doesn’t matter). It just depends on what representation is required for your specific application.
@straight-shoota , good to know; thanks!