Frequency distribution of words in text : Is this code idiomatic?

Is this code idiomatic, is it good practice ?

dico = Hash(String, Int32).new

File.each_line("../100-0.txt") do |line|
  a = line.split
  a.each do |word|
    word = word.strip(";,.'?!\n")
    word = word.downcase
    if dico[word]? != nil
      dico[word] += 1
    else
      dico[word] = 0
    end
  end
end

a = dico.to_a
a.sort_by! { |key_val| key_val[1] }
a.reverse!
a[0, 10].each do |item|
  puts item
end

There aren’t really any guides on what “idiomatic” means in Crystal, so I’m just going off of what I do and what I’ve seen in the community (which I guess is what “idiomatic” really ought to mean anyway). In order to provide runnable code (for me and for you and for any future readers), I’m gonna use an IO::Memory to simulate your file, and I’m gonna do a few different passes/versions of the code to show some idiomatic stuff and some really useful methods.

First, let’s keep the general structure of your code but change some lines to make them more idiomatic (and fix a bug!):

dico = Hash(String, Int32).new

sample_text = <<-TEXT
  something something some thing
  some thing something arduous
  something some some thing
  TEXT

# we're pretending this IO is a File
io = IO::Memory.new sample_text

io.each_line do |line|
  a = line.split
  a.each do |word|
    word = word.strip(";,.'?!\n").downcase # we can do both of these operations in one line
    if dico[word]? # don't need the != nil; nil is falsey
      dico[word] += 1
    else
      dico[word] = 1 # gotta change this from 0 to 1
    end
  end
end

a = dico.to_a.sort_by { |key_val| key_val[1] }.reverse # we can do all of this in one line without mutating methods
a.each do |item| # we're deleting the [0,10] because it's not relevant to our example
  puts item
end

The if dico[word]? line is a pretty common pattern in Crystal, since nil is already falsey. However, the next iteration shows a more convenient way to do what you’re doing with a Hash.

Putting stuff in one line is totally a preference thing. In fact, moving the key-sorting to one line and using non-mutating methods will technically cause unnecessary memory allocation. In most cases, I think that’s not an optimization worth pursuing, but it’s up to you.

You can actually simplify the hash logic even more:

dico = Hash(String, Int32).new default_value: 0

sample_text = <<-TEXT
  something something some thing
  some thing something arduous
  something some some thing
  TEXT

io = IO::Memory.new sample_text

io.each_line do |line|
  a = line.split
  a.each do |word|
    word = word.strip(";,.'?!\n").downcase
    dico[word] += 1
  end
end

a = dico.to_a.sort_by { |key_val| key_val[1] }.reverse
a.each do |item|
  puts item
end

By giving the hash a default value, you make it so that when dico[word] += 1 expands to dico[word] = dico[word] + 1, the dico[word] on the right side is just equal to zero. You could also simplify that logic by leaving the hash initialization alone and using

dico[word] ||= 0 # sets dico[word] to 0 if it's falsey (i.e. nil)
dico[word] += 1

However, I think the default value is cleaner and simpler.

Finally, it turns out that the thing you’re trying to do is already a standard library method:

sample_text = <<-TEXT
  something something some thing
  some thing something arduous
  something some some thing
  TEXT

io = IO::Memory.new sample_text

dico = io.each_line.flat_map { |line| line.split }.tally_by { |word| word.strip(";,.'?!\n").downcase }

a = dico.to_a.sort_by { |key_val| key_val[1] }.reverse
a.each do |item|
  puts item
end

IO#each_line gets us an Iterator of each line, then we #flat_map to iterate over each word, and then we #tally up the counts of the words after they’ve been stripped and downcased. I’m not totally sure, but I don’t think that should result in much more memory allocation than your code, and it relies on the standard library implementation, which reduces your code’s complexity (and removes the possibility of sneaky off-by-one errors!).

Does that all make sense?

3 Likes

Very enlightening. Thank you very much !

1 Like

By a quick look I just noticed that you can save the final reverse when you invert the sort value:
.sort_by { |key_val| key_val[1] }.reverse => .sort_by { |key_val| -key_val[1] }

2 Likes

Thanks !

An interesting variation using a tokenizer:

require "cadmium_tokenizer"

tokenizer = Cadmium::Tokenizer::Aggressive.new
dico = Hash(String, Int32).new default_value: 0

# Read data file
File.each_line("../../../100-0.txt") do |line|
  tokenizer.tokenize(line)
    .reject { |a| a.size == 1 && !a[0].ascii_alphanumeric? }
    .each { |word|
      word = word.downcase
      dico[word] += 1
    }
end

# Dict to array and sort (descending)
entries = dico.to_a
  .sort_by { |key_val| -key_val[1] }

# Display list top 10 results
entries[0, 10].each do |item|
  puts item
end

3 Likes