Iterators and JSON

wonderix · February 15, 2021, 7:31am

Wouldn’t it make sense to have also to_json and from_json for Iterators? As far as I can see, it’s required to convert them into an array before converting them into JSON.

straight-shoota · February 15, 2021, 11:27am

from_json won’t really work because initialization of an iterator may be more complex than just concatenating a bunch of items as with an array. It actually can’t be generalized. Every iterator implementation needs to define it explicitly.

to_json might be a good idea. Although I never encoutered an acutal use case. Can you present any?

wonderix · February 15, 2021, 11:51am

The use case would be to stream a huge set of data from a database to a rest endpoint in json format without the need to have enough memory to hold the complete dataset.

asterite · February 15, 2021, 3:08pm

What do you mean? I can imagine it’s similar to deserializing an Array (Array.from_json) except that it’s done lazily and you can request one by one?

I actually think this is an excellent idea.

straight-shoota · February 15, 2021, 3:31pm

Oh, you mean an iterator on a JSON array. I was talking about deserializing a generic iterator from JSON.

Sure, that could technically work. It would basically be the iterator version of the yielding Array.from_json(pull, &).
But the iterator would depend on the pull parser and parsing further can’t continue until the iterator is completed. The yielding variant makes sure that control flow outside only continues after the array has been completely consumed. I’d figure that it doesn’t add much benefit over the yielding variant which should really fit for most use cases.

asterite · February 15, 2021, 3:33pm

I mean, you could do things like:

Iterator(Model).from_json(json).select { ... }.first(3).to_a

I think that’s worth it, and you can’t do it with Array.from_json.

@wonderix is that what you meant?

wonderix · February 15, 2021, 4:00pm

That was exactly what I meant.

straight-shoota · February 15, 2021, 4:01pm

Yeah, that would work as long as you’re directly consuming a JSON array and don’t care about whether it’s parsed entirely or not.

But consider using Interator(Model) with JSON::Serializable. It would completely break JSON deserialization. from_json is expected to consume the JSON value entirely.

wonderix · February 15, 2021, 4:02pm

That’s correct. It will break the semantic of from_json.

asterite · February 15, 2021, 4:47pm

Why would the expectation be the same as for anything else? Iterator is lazy, so I wouldn’t expect Iterator(..).from_json to consume it entirely.

I also think this has nothing to do with JSON::Serializable, which maps a type’s properties to JSON.

straight-shoota · February 15, 2021, 5:33pm

Ah I missed that Serializable calls new, not from_json. It only calls from_json on converters. Yeah, I guess it’s fine then.

asterite · February 16, 2021, 7:25pm

If anyone wants to send a PR for this, here’s some code:

require "json"

record Person, name : String, age : Int32 do
  include JSON::Serializable
end

module Iterator(T)
  def self.from_json(json)
    FromJson(T).new(json)
  end

  class FromJson(T)
    include Iterator(T)

    def initialize(json)
      @pull = JSON::PullParser.new(json)
      @pull.read_begin_array
      @end = false
    end

    def next
      if @end
        stop
      elsif @pull.kind.end_array?
        @pull.read_next
        @end = true
        stop
      else
        T.new(@pull)
      end
    end
  end
end

json = <<-JSON
  [
    {"name": "Ary", "age": 39},
    {"name": "Anabella", "age": 36},
    {"name": "Luca", "age": 2},
    {"name": "Tahiel", "age": 0}
  ]
JSON

it = Iterator(Person).from_json(json)
pp it.select { |person| person.age > 0 }.first(2).to_a

Of course the PR should include some tests.

wonderix · February 17, 2021, 5:34am

I will prepare a PR in the next days.

asterite · February 17, 2021, 12:55pm

There’s something missing: after the last @pull.kind.end_array? we should make sure no other tokens come in the JSON source (otherwise it’s an invalid JSON). Probably something minor, though.

pynixwang · February 17, 2021, 6:19pm

this is good for parse a large json array file, but the use case is very limited, for api, it 's hard to response a large json. and it can 'nt check invalid json.

I think not let this in stdlib.

RespiteSage · February 17, 2021, 7:26pm

Yeah, this seems like it would need a very performance-sensitive use case to justify it.

Reading in the entire file with File.read and using my GeoJSON parsing library* to parse it (which uses JSON::Serializable), I get the following times (averaged over 10 parsing runs) for parsing:

Small File (800 B) Time: 00:00:00.000036265
Large File (56 MB) Time: 00:00:00.704692836
Huge File (257 MB) Time: 00:00:02.970463296

Including the file reading (using File.open with a block and passing the file IO to .from_json), here are my times (for a single run of each):

Small File (800 B): 00:00:00.000064845
Large File (56 MB): 00:00:01.656026457
Huge File (257 MB): 00:00:07.308225249

My point is just that the JSON parsing is already very, very fast. It seems like it ought to be fine to parse hundreds of MB of JSON in 7 seconds (or 3, if you’re only considering the parsing). If you need a JSON iterator (and it’s actually faster) then fine, but I doubt it’s a meaningful enough optimization to be in the standard library.

* I used it because it’s convenient for me; if you need to parse GeoJSON use the relevant GeoCrystal shard

asterite · February 17, 2021, 7:39pm

Nice analysis! What’s the memory usage for each of the alternatives?

RespiteSage · February 17, 2021, 10:22pm

Ah, right. I guess it does make sense that the iterator would be better for memory optimization…

Here are some values from Benchmark.memory:

Memory for Combined Reading and Parsing (File.open(..., &))
Single Run Small (800 B): 17.1k bytes
Single Run Large (56 MB): 635M bytes
Single Run Huge (257 MB): 2.99G bytes

Memory for Only Parsing (from string)
Single Run Small (800 B): 6.5k bytes
Single Run Large (56 MB): 635M bytes
Single Run Huge (257 MB): 2.99G bytes

Memory for Separate Reading and then Parsing (reading into string, then parsing from string)
Single Run Small (800 B): 9.49k bytes
Single Run Large (56 MB): 693M bytes
Single Run Huge (257 MB): 3.26G bytes

That’s obviously pretty hefty. That said, I feel like my measurements are off somehow… My system monitor never showed the process taking more than 2.5-ish GB.

Yeah, I guess I could go either way on this, then. If the iterator actually reduces the memory overhead significantly, that does seem useful.

Side note: Number#humanize is a wonderful convenience.

straight-shoota · February 18, 2021, 12:52am

The yielding variant of Array.from_json also avoids putting everything in memory.

wonderix · February 18, 2021, 5:41am

Array.from_json(string_or_io, &) is really nice. On the other hand, I would have never expected such a method in the Array class because it never produces an Array instance. Additionally, I like the ability of Iterators to process the entries lazily by chaining method calls.

Topic		Replies	Views
Performance issues with the JSON parser Help & Support	24	645	March 21, 2024
Complex JSON deserialization Help & Support	25	883	May 28, 2024
WoW! I finally found a way to show the differentiation in speed between JSON::Any and static types	20	1460	April 15, 2019
Newline-delimited JSON Help & Support	12	499	August 29, 2021
Creeating struct from JSON Help & Support	6	357	August 9, 2021

Iterators and JSON

Related topics