Complex JSON deserialization

I’m reading a JSON document (and then putting it in a sqlite database). The document is much more complicated than I need to actually track, so I’m trying to figure out how to do two things:

First question:

There is a list of associated objects (an array of strings) that is buried deep in nested JSON objects that I otherwise don’t need to extract. For example, imagine the document looks like…

{
  "id": 12345,
  "name": "Paul",
  "complicated": {
    "object": {
      "nested": {
        "deeply": [
          "a",
          "b",
          "c"
        ]
      }
    }
  }
}

And my class looks like:

class Thing
  include JSON::Serializable
  property id : Int64
  property name : String
  property nested_list : Array(String)
  # ...
end

Since the array is nested more than one object deep, is there any way to use JSON::Serializable? Or am I going to have to write a custom from_json function?

Or… is there a way I can have JSON::Serializable trigger on the “complicated” key, and call a custom method to decode it?

Second question:

I would really like my class to have a property, raw_json, which is a String containing the full raw JSON from which it was decoded.

Is this possible with JSON::Serializable in any way?

Using the example above:

class Thing
  include JSON::Serializable
  property id : Int64
  property name : String
  property nested_list : Array(String)
  property raw_json : String
end

json_string="{"id":12345,"name":"Paul","complicated":{"object":{"nested":{"deeply":["a","b","c"]}}}}"
thing=Thing.from_json(json_string)
thing.raw_json == json_string # => true

I know that’s likely to be difficult; I know instead that I can make it nilable, and assign it after creating the object. For example:

class Thing
  def self.new(json_string : String)
    thing = Thing.from_json(json_string)
    thing.raw_json=json_string
    thing
  end
end

But it would be nice if I could do it inside the Thing.from_json class method, since then I would be able to parse it when it’s embedded in other documents.

I hope these questions are clear enough that someone can make some suggestions. The first question is definitely more important than the second…

Thanks!

Paul

i started crystal a little time ago but coded in ruby. In ruby you can directly call the pandas bindings as require “Pandas” and then you can call the function. pd.from_json(“json_file”) and it will give you a direct normalized frame on which you can put any iteration.

Thanks for the response.

I think Pandas is a python library, is it not? And the ruby “pandas” module calls the python code? I can’t embed python and the pandas library in my program, and I’m not a python programmer.

I apologize for not seeing what you’re suggesting that I do. Can you explain it a different way, so that I might be able to understand?

Thanks again!

Paul

Yes pandas is a python library but with the bindings you can call in ruby also. This is what we calling in the first line require “Pandas” and this will attach the pandas bindings and then you can call those functions. If you want a crystal based compiler then attach the polars as for that also there is a ruby binding. see here, you can also install the gem directly and call the same. GitHub - mrkn/pandas.rb: Pandas wrapper for Ruby

if you just want to put the data into a sqlite database then call the jq . like this with the backticks with in the crystal function and it will parse the json and will put much easier. A single line of jq will solve and will remove all those iterations and back tracking the for loops.

Pretty sure you can use root for that. Check out the api document.

1 Like

Yes you can do the latter with custom converter for this particular, nested field.

  @[JSON::Field(key: "complicated", converter: Thing::NestedArrayConverter)]
  property nested_array : Array(String)
  
  module NestedArrayConverter
    def self.from_json(pull)
      pull.on_key!("object") do
        pull.on_key!("nested") do
          pull.on_key!("deeply") do
            return Array(String).new(pull)
          end
        end
      end
    end
  end

I suppose it would be nice if the root property of JSON::Field could receive a list of keys instead of a single key. (see `root` property of `JSON::Field` should support nested keys · Issue #13894 · crystal-lang/crystal · GitHub)

1 Like

For the raw_json property I don’t think there is any way to make Serializable do that for you. The JSON parser operates on a stream and only consumes that stream once. So you’ll have to duplicate the contents outside of the deserialization logic.

Instead of .new you can override .from_json for this:

class Thing
  def self.from_json(string : String)
    super.tap do |thing|
      thing.raw_json = string
    end
  end
end

Note that this won’t cover the signature Thing.from_json(IO).

For one at least, Athena Serializer supports this. Could be an option until the stdlib has support:

class Thing
  include ASR::Serializable

  property id : Int64
  property name : String

  @[ASRA::Accessor(path: {"complicated", "object", "nested", "deeply"})]
  property nested_list : Array(String)
end

ASR.serializer.deserialize Thing, DATA, :json
# => #<Thing:0x7f6f9ae4ee70 @id=12345, @name="Paul", @nested_list=["a", "b", "c"]>
4 Likes

This make me think that I wont be the only one that could find my imcomplete proposal to stdlib useful.

The above code would be

  @[JSON::Field(key: "complicated", converter: Thing::NestedArrayConverter)]
  property nested_array : Array(String)
  
  module NestedArrayConverter
    def self.from_json(pull)
      pull.on_key!("object", "nested", "deeply") do
        return Array(String).new(pull)
      end
    end
  end

a trick i do is just parse the data with JSON.parse so it handles all the deserialization/nested objects automatically. then use raw.to_xx on them and put that data in a class or wherever i need. this way you don’t need to fiddle around and keep annotations or other hard-coded code synced

i remember suggesting a way to handle this years ago where you could specify the keys and their types, and let crystal convert the JSON.parse result (JSON::Any) to their static types

in fact, JSON.parse is extremely underrated for what it does, no reason we can’t harness what it returns and put static types on those results dynamically. just not sure how, but it would def help clear up all the dynamic JSON problems we’ve been running into for years

1 Like

Isn’t this just what JSON::Serializable is?

Agreed, I implemented a similar concept here in an experimental shard (with a JSON::Any-style struct defined here) and it worked well. There is a minor performance difference between the two though, so I wonder if that would affect decisions for changing the implementation.

require "benchmark"
require "json"

class Vector
  include JSON::Serializable

  getter x : Float64
  getter y : Float64
  getter z : Float64

  def initialize(@x, @y, @z)
  end
end

SOURCE = %({"x": 51.4534, "y": 65.89, "z": 32.112})

c = false
x : Vector
y : Vector

Benchmark.ips do |ips|
  ips.report("control") { c = false }

  ips.report("JSON::Serializable") do
    x = Vector.from_json SOURCE
  end

  ips.report("JSON.parse") do
    json = JSON.parse SOURCE
    y = Vector.new(json["x"].as_f, json["y"].as_f, json["z"].as_f)
  end
end

I ran crystal build --no-debug --release bench.cr then ran the binaries twice (first just for system warmup, second for results).

Windows

           control 653.56M (  1.53ns) (± 6.31%)  0.0B/op          fastest
JSON::Serializable 417.11k (  2.40µs) (± 6.43%)  880B/op  1566.88× slower
        JSON.parse 390.84k (  2.56µs) (± 4.68%)  992B/op  1672.20× slower

WSL (Ubuntu)

           control 638.57M (  1.57ns) (±11.05%)  0.0B/op         fastest
JSON::Serializable 767.70k (  1.30µs) (± 5.57%)  880B/op  831.80× slower
        JSON.parse 742.27k (  1.35µs) (± 4.77%)  992B/op  860.30× slower

That looks like a lot of big numbers, but that’s because of the control variable. For Windows it’s ±26.27k iterations, being ~105.32x slower. For WSL it’s ±25.44 iterations, being ~28.5x slower. I don’t know if there are optimisations that could be made to make JSON::Any faster than JSON::Serializable within normal constraints.

That’s not how you interpret the ratios when a control is present, since the denominator is not in terms of the control. On Windows it is (1672.20 - 1) ÷ (1566.88 - 1) = 1.067× slower, on WSL it’s (860.30 - 1) ÷ (831.80 - 1) = 1.034× slower. The control time differs by 3 orders of magnitude so it is not very useful here.

1 Like

Oh right :person_facepalming: Thanks for that, given that’s such a small difference I guess the change could be made then? For nested keys JSON::Any#dig? could be taken advantage of.

Also, is there a guide for calculations and using control variables in specs?

Related: Iterations-per-second benchmarks with control jobs / `#before_each` · Issue #13284 · crystal-lang/crystal · GitHub

1 Like

What change do you refer to?

Changing JSON::Serializable to use JSON::Any. It would provide a solution for the original post and solve a lot of dynamic JSON issues with the module, as pointed out by @girng. Then again, it may be easier to just add on top of the pull parser code to handle these cases.

In a project of mine I remember I did a macro to build Objects from JSON::Any, like JSON::Serializable does with a PullParser. But in the end I was in doubt if this approach was just a result of a bad design of my code used to read the complex JSON or if it was really necessary and a good approach.

I’m not sure that would be a good idea. JSON::Any needs to be read entirely into memory before being usable. The pull parser processes data as a stream, enables custom converters and skipping over portions of the document that are irrelevant.
That’s a very different approach resulting in many efficiency benefits. This is certainly noticable when handling non-trivial JSON documents.

2 Likes