Iterators and JSON

Wouldn’t it make sense to have also to_json and from_json for Iterators? As far as I can see, it’s required to convert them into an array before converting them into JSON.

from_json won’t really work because initialization of an iterator may be more complex than just concatenating a bunch of items as with an array. It actually can’t be generalized. Every iterator implementation needs to define it explicitly.

to_json might be a good idea. Although I never encoutered an acutal use case. Can you present any?

The use case would be to stream a huge set of data from a database to a rest endpoint in json format without the need to have enough memory to hold the complete dataset.

What do you mean? I can imagine it’s similar to deserializing an Array (Array.from_json) except that it’s done lazily and you can request one by one?

I actually think this is an excellent idea.

Oh, you mean an iterator on a JSON array. I was talking about deserializing a generic iterator from JSON.

Sure, that could technically work. It would basically be the iterator version of the yielding Array.from_json(pull, &).
But the iterator would depend on the pull parser and parsing further can’t continue until the iterator is completed. The yielding variant makes sure that control flow outside only continues after the array has been completely consumed. I’d figure that it doesn’t add much benefit over the yielding variant which should really fit for most use cases.

I mean, you could do things like:

Iterator(Model).from_json(json).select { ... }.first(3).to_a

I think that’s worth it, and you can’t do it with Array.from_json.

@wonderix is that what you meant?

That was exactly what I meant.

Yeah, that would work as long as you’re directly consuming a JSON array and don’t care about whether it’s parsed entirely or not.

But consider using Interator(Model) with JSON::Serializable. It would completely break JSON deserialization. from_json is expected to consume the JSON value entirely.

That’s correct. It will break the semantic of from_json.

Why would the expectation be the same as for anything else? Iterator is lazy, so I wouldn’t expect Iterator(..).from_json to consume it entirely.

I also think this has nothing to do with JSON::Serializable, which maps a type’s properties to JSON.

Ah I missed that Serializable calls new, not from_json. It only calls from_json on converters. Yeah, I guess it’s fine then.

If anyone wants to send a PR for this, here’s some code:

require "json"

record Person, name : String, age : Int32 do
  include JSON::Serializable
end

module Iterator(T)
  def self.from_json(json)
    FromJson(T).new(json)
  end

  class FromJson(T)
    include Iterator(T)

    def initialize(json)
      @pull = JSON::PullParser.new(json)
      @pull.read_begin_array
      @end = false
    end

    def next
      if @end
        stop
      elsif @pull.kind.end_array?
        @pull.read_next
        @end = true
        stop
      else
        T.new(@pull)
      end
    end
  end
end

json = <<-JSON
  [
    {"name": "Ary", "age": 39},
    {"name": "Anabella", "age": 36},
    {"name": "Luca", "age": 2},
    {"name": "Tahiel", "age": 0}
  ]
JSON

it = Iterator(Person).from_json(json)
pp it.select { |person| person.age > 0 }.first(2).to_a

Of course the PR should include some tests.

1 Like

I will prepare a PR in the next days.

2 Likes

There’s something missing: after the last @pull.kind.end_array? we should make sure no other tokens come in the JSON source (otherwise it’s an invalid JSON). Probably something minor, though.

this is good for parse a large json array file, but the use case is very limited, for api, it 's hard to response a large json. and it can 'nt check invalid json.

I think not let this in stdlib.

Yeah, this seems like it would need a very performance-sensitive use case to justify it.

Reading in the entire file with File.read and using my GeoJSON parsing library* to parse it (which uses JSON::Serializable), I get the following times (averaged over 10 parsing runs) for parsing:

Small File (800 B) Time: 00:00:00.000036265
Large File (56 MB) Time: 00:00:00.704692836
Huge File (257 MB) Time: 00:00:02.970463296

Including the file reading (using File.open with a block and passing the file IO to .from_json), here are my times (for a single run of each):

Small File (800 B): 00:00:00.000064845
Large File (56 MB): 00:00:01.656026457
Huge File (257 MB): 00:00:07.308225249

My point is just that the JSON parsing is already very, very fast. It seems like it ought to be fine to parse hundreds of MB of JSON in 7 seconds (or 3, if you’re only considering the parsing). If you need a JSON iterator (and it’s actually faster) then fine, but I doubt it’s a meaningful enough optimization to be in the standard library.

* I used it because it’s convenient for me; if you need to parse GeoJSON use the relevant GeoCrystal shard

1 Like

Nice analysis! What’s the memory usage for each of the alternatives?

Ah, right. I guess it does make sense that the iterator would be better for memory optimization…

Here are some values from Benchmark.memory:

Memory for Combined Reading and Parsing (File.open(..., &))
Single Run Small (800 B): 17.1k bytes
Single Run Large (56 MB): 635M bytes
Single Run Huge (257 MB): 2.99G bytes

Memory for Only Parsing (from string)
Single Run Small (800 B): 6.5k bytes
Single Run Large (56 MB): 635M bytes
Single Run Huge (257 MB): 2.99G bytes

Memory for Separate Reading and then Parsing (reading into string, then parsing from string)
Single Run Small (800 B): 9.49k bytes
Single Run Large (56 MB): 693M bytes
Single Run Huge (257 MB): 3.26G bytes

That’s obviously pretty hefty. That said, I feel like my measurements are off somehow… My system monitor never showed the process taking more than 2.5-ish GB.

Yeah, I guess I could go either way on this, then. If the iterator actually reduces the memory overhead significantly, that does seem useful.

Side note: Number#humanize is a wonderful convenience.

1 Like

The yielding variant of Array.from_json also avoids putting everything in memory.

1 Like

Array.from_json(string_or_io, &) is really nice. On the other hand, I would have never expected such a method in the Array class because it never produces an Array instance. Additionally, I like the ability of Iterators to process the entries lazily by chaining method calls.