I did some testing on my machine, and it certainly seems like the parsing is slower than I remember when testing GeoJSON parsing a while back. I could be misremembering, though, and I can’t find the benchmarks I made (on pretty large files) at the moment.
Here’s my code:
require "json"
require "random"
require "file"
require "time"
struct Inner
include JSON::Serializable
property inner_name : String
property numbers : Array(Int32)
def initialize(
@inner_name : String,
@numbers : Array(Int32) = Array(Int32).new
)
end
end
struct Middle
include JSON::Serializable
property middle_name : String
property inner_values : Array(Inner)
def initialize(
@middle_name : String,
@inner_values : Array(Inner) = Array(Inner).new
)
end
end
struct Outer
include JSON::Serializable
property outer_name : String
property middle_values : Array(Middle)
def initialize(
@outer_name : String,
@middle_values : Array(Middle) = Array(Middle).new
)
end
end
def create_structure(scale_factor : Int32, rng : Random, numbers_range : Range(Int32, Int32)) : Array(Outer)
Array(Outer).new(scale_factor) {
Outer.new(
rng.base64,
Array(Middle).new(scale_factor) {
Middle.new(
rng.base64,
Array(Inner).new(scale_factor) {
Inner.new(
rng.base64,
Array(Int32).new(scale_factor) { rng.rand(numbers_range) }
)
}
)
}
)
}
end
def count(structure : Array(Outer))
structure.sum { |outer|
outer.middle_values.sum { |middle|
middle.inner_values.sum { |inner|
0_u64 + inner.numbers.size
}
}
}
end
def puts_seconds_elapsed(label, start_time, end_time)
puts "#{label}: #{(end_time - start_time).total_seconds}s"
end
scale_factor = 100
if ARGV.size > 0 && (first_arg_int = ARGV.first.to_i?)
scale_factor = first_arg_int
end
filename = "big_json.json"
rng = Random.new(seed: scale_factor)
numbers_range = (1000..9999)
do_write = true
if do_write
structure = create_structure scale_factor, rng, numbers_range
begin
start_time = Time.monotonic
structure_json = structure.to_json
end_time = Time.monotonic
puts_seconds_elapsed "serialization to string", start_time, end_time
file = File.open filename, "w"
start_time = Time.monotonic
file << structure_json
end_time = Time.monotonic
file.close
puts_seconds_elapsed "file write from string", start_time, end_time
end
file = File.open filename, "w"
start_time = Time.monotonic
structure.to_json file
end_time = Time.monotonic
file.close
puts_seconds_elapsed "file write with serialization", start_time, end_time
end
begin
start_time = Time.monotonic
file_contents = File.read filename
end_time = Time.monotonic
puts_seconds_elapsed "file read to string", start_time, end_time
start_time = Time.monotonic
structure_from_file = Array(Outer).from_json file_contents
end_time = Time.monotonic
puts_seconds_elapsed "parsing from string", start_time, end_time
# just to make sure the compiler doesn't elide anything
File.write File::NULL, count(structure_from_file)
end
file = File.open(filename, "r")
start_time = Time.monotonic
structure_from_file = Array(Outer).from_json file
end_time = Time.monotonic
file.close
puts_seconds_elapsed "parsing from file", start_time, end_time
# just to make sure the compiler doesn't elide anything
File.write File::NULL, count(structure_from_file)
Notes on the Code
- I tried out different
File
buffering settings, but it didn’t seem to make any difference, even in the “write with serialization” and “parsing from file” cases.
- The
begin...end
blocks are an attempt to create variable scopes to help manage memory usage, but I don’t know if that actually works.
- I tried to make the serializable structures as simple as possible (to make it easier to review) while still exhibiting nesting, since real-world JSON tends to be heavily nested.
- I made basically no attempt to optimize
create_structure
or count
because they’re not what I was trying to benchmark.
Example produced JSON, with scale factor 2, after formatting with jq
[
{
"outer_name": "qetdD4TLe9Ijt+J9Z+dlYg==",
"middle_values": [
{
"middle_name": "T81VaTBy7EJ+2r4G2fATSA==",
"inner_values": [
{
"inner_name": "epJ0N7wZSPdM/UZJzuTAvA==",
"numbers": [
9016,
9814
]
},
{
"inner_name": "gC1J8zsb6sXhl9i6A67Apw==",
"numbers": [
2739,
3830
]
}
]
},
{
"middle_name": "0I9taSfFVEFJNkUbOPnJxA==",
"inner_values": [
{
"inner_name": "bBlzJ6IPbI53SC+4LLIjAg==",
"numbers": [
1986,
5623
]
},
{
"inner_name": "x9Z5bWIal4qRClJfeMw2fg==",
"numbers": [
8853,
7967
]
}
]
}
]
},
{
"outer_name": "siItIL1Wb72iq3N/bqYoYQ==",
"middle_values": [
{
"middle_name": "K/j4cgIgOXpV1juImq15uQ==",
"inner_values": [
{
"inner_name": "os64AVLIAuYGuhKhBaxZDw==",
"numbers": [
9189,
1888
]
},
{
"inner_name": "CguKhvwLFKCG8WkAtlTUWA==",
"numbers": [
9455,
9214
]
}
]
},
{
"middle_name": "uIYmymvfO2Y2k8wQXjCB6Q==",
"inner_values": [
{
"inner_name": "RUZapun49A2gzOHArkubNA==",
"numbers": [
6706,
3441
]
},
{
"inner_name": "Fc+DBmjHNtxcevNweLKyQQ==",
"numbers": [
5703,
9299
]
}
]
}
]
}
]
And here’s the output I’m getting on my machine:
Scale Factor 10 (112 kb file)
serialization to string: 0.003119035s
file write from string: 0.0001233s
file write with serialization: 0.001344965s
file read to string: 0.000161006s
parsing from string: 0.002256683s
parsing from file: 0.005329564s
Scale Factor 50 (37 mb file)
serialization to string: 0.407213263s
file write from string: 0.036222296s
file write with serialization: 0.428914678s
file read to string: 0.029467402s
parsing from string: 0.910451311s
parsing from file: 1.886663833s
Scale Factor 100 (529 mb file)
serialization to string: 6.845525082s
file write from string: 0.533993881s
file write with serialization: 7.428256482s
file read to string: 0.199061933s
parsing from string: 15.233927115s
parsing from file: 29.497250194s