Statically Parse CSV to Hash

I feel like this would be a great addition to the language either as a shard, or if added to the CSV class.

Or, maybe it’s already possible? Here is an example:

csv_data = %(name,min_value,max_value,mod_type
,,,
of Skill,5,7,i_atk_speed
of Ease,8,10,i_atk_speed
of Mastery,11,13,i_atk_speed
of Renown,14,17,i_atk_speed
of Clockwork,18,21,i_atk_speed)


ModSuffixWeapons = CSV.parse_to_hash(csv_data, {name: String, min_value: Int32, max_value: Int32, mod_type: String})

# Now the properties could be accessed like so:
puts ModSuffixWeapons["of Skill"]["min_value"]
# They keys are always the first column of rows, or it could be set manually
# ["min_value"] is statically typed as an Int32


# And.. for a struct.. i'm not sure how to do that as you'd need a custom overload for key access []

What are your thoughts?

With the help of gitter and @RespiteSage, @watzon, @kai (thank you!), here is what I have so far:

https://play.crystal-lang.org/#/r/74e6

Problem is, we will need to invoke to_i every time we access ["min_value"]. Is there a way to make it permanently an Int32?

Basically, we were thinking of a from_json like method, but for CSV (from_csv). That would be amazeballs.

I crafted a possible solution? I am so happy! We basically convert the values before putting them into a class initializer, which accepts their converted values and basically makes them statically typed.

https://play.crystal-lang.org/#/r/74h9

No more using .to_i’s. Thank you @Blacksmoke16 and @oprypin for the help on gitter.

Another alternative could be to:

  • convert to int if possible on every column
  • declare explicitly the types expected for each column
  • use the Hash to NamedTuple conversion

This will work as long as in an expected text column there is no number. But it can show some handy tricks.

Unfortunately due to the type restrictions and some method the NamedTuple.from method does not work directly on Hash-like API as CSV row.

https://play.crystal-lang.org/#/r/74le

require "csv"

csv_data = %(name,min_value,max_value,mod_type
,,,
of Skill,5,7,i_atk_speed
of Ease,8,10,i_atk_speed
of Mastery,11,13,i_atk_speed
of Renown,14,17,i_atk_speed
of Clockwork,18,21,i_atk_speed)

csv = CSV.new csv_data, headers: true
alias TypedRow = {name: String, min_value: Int32, max_value: Int32, mod_type: String}

csv.each do |instance|
  h = instance.row.to_h
  next if h.values.all? &.empty?
  h = h.transform_values { |v| v.to_i32? || v }
  row = TypedRow.from(h)
  pp! row # => {name: "of Skill", min_value: 5, max_value: 7, mod_type: "i_atk_speed"} : TypedRow
end

It would also be possible to create a NamedTuple.from_csv_row that will apply conversion depending on the type of each component.


With that idea, the following can be done: https://play.crystal-lang.org/#/r/74lf

require "csv"


csv_data = %(name,min_value,max_value,mod_type
,,,
of Skill,5,7,i_atk_speed
of Ease,8,10,i_atk_speed
of Mastery,11,13,i_atk_speed
of Renown,14,17,i_atk_speed
of Clockwork,18,21,i_atk_speed)

csv = CSV.new csv_data, headers: true
alias TypedRow = {name: String, min_value: Int32, max_value: Int32, mod_type: String}

csv.each do |instance|
  next if instance.row.to_a.all? &.empty?
  row = TypedRow.from_csv_row(instance.row)
  pp! row # => {name: "of Skill", min_value: 5, max_value: 7, mod_type: "i_atk_speed"} : TypedRow
end

struct NamedTuple
  def self.from_csv_row(row : CSV::Row) : self
    {% begin %}
      NamedTuple.new(**{{T}}).from_csv_row(row)
    {% end %}
  end

  def from_csv_row(row : CSV::Row)
    {% begin %}
      NamedTuple.new(
      {% for key, value in T %}
        {% if value == Int32.class %}
          {{key.stringify}}: row["{{key}}"].to_i32,
        {% else %}
          {{key.stringify}}: row["{{key}}"],
        {% end %}
      {% end %}
      )
    {% end %}
  end
end
3 Likes

@bcardiff Nice!! I played with this a bit yesterday but never finished it, it’s great to see that you got to do many things.

I wonder if it would be a good idea to add some of that to the standard library in the form of CSV::Serializable. It works a bit different from JSON::Serializable because we need types that map to a full row (like a NamedTuple or a custom type like Person) and then types that map to a single column.

But it could be useful. For example you could map string, int, float, but also dates and enums.

It’s a fun little project to do.

(I also remember we had such thing in an internal tool in Manas, though it was halfway done)

EDIT: also, I guess this would only work for parsing a CSV… but that’s already pretty useful.

1 Like

The problem to making it in the std-lib is that the format is not standard. CSV does not define how to encode numbers, strings, times, etc. So the solution is always app specific, or it will be biased to support a JSON/YAML-like set of values for convenient.

I do think there could be benefit for NamedTuple.from to support other than hash, to avoid creating intermediate structure.

As a spin off of this, that could also benefit crystal-db and others, is to have a nice converter library that can be customized per usage.

1 Like

It’s true that the format is not standard. However, parsing strings, ints and floats is kind of standard so at least having those supported in the standard library would be nice-to-have. Probably enums too, if you map using their names (or numbers). But anything more complex than that would be left to be decided by a user.

I would expect to handle i18n for CSV parsing. So having configurable conversions seems appealing to me.

I don’t expect that from structured format like JSON and YAML.

1 Like

Since this is possible. I am now 100% convinced JSON.parse could statically type values. And with the power of macros, remove the need to use method invocation calls everywhere (to_i, as_s, etc).

If you can do it with CSV data, there is no reason it cannot be done with JSON data.

In @bcardiff’s first example, there is literally just alias TypedRow = {name: String, min_value: Int32, max_value: Int32, mod_type: String} used. And boom, no need for method invocations. Same should be possible for JSON.parse.

And especially for recursive data so you don’t need to maintain a mapping of the structure. Obviously, crystal is statically typed, so it’s not a bad thing, but we’re talking about the context of JSON.parse, and this makes that process so easy already, it should be even better (remove need for method invocation calls). This will allow JSON.parse take advantage of statically typed values.

@girng note that the to_i calls are still there. In my first example it could even be applied to text columns that happens to contain numbers only. And in the second example is on demand based on the expected type, which is more accurate.

1 Like