Database, Array, Hash


#1

Hello,

yesterday I used Crystal for the first time. I wanted to see how it compares to Ruby in a simple but important task. It’s about processing and querying data. As a temporary step, I wanted to read hashes from the database, fix the format (real numbers stored as text), order them by the values, write it back. I also wanted to write back an ordered Array of the keys as the JSON keys aren’t ordered.

The JSON strings (Hashes) have tens of thousand key-value pairs.

Ruby version:

require 'pg'
require 'json'

base = PG.connect('host=localhost dbname=xxx')
rows = base.query('select * from temp order by user_id limit 1000')

puts rows.class
start = Time.now

rows.each do |row|
	results_hash = JSON.parse row['percentages']
	results = results_hash.map { |key, value| [key.to_i, value.to_f] }.sort_by { |x| x[1] }.reverse.to_h
	puts results.inspect[0..20]
end

puts Time.now - start
$ time ./reorder.rb
PG::Result
{362369=>90.36, 74229
{163453=>50.71, 24970
...
{910276=>86.16, 53256
{395316=>84.63, 92819
{932416=>85.71, 78133
57.163174

real	1m1.122s

Ruby version with converting the rows into an Array from the PGResult. Only one line differs:

...
rows = base.query('select * from temp order by user_id limit 1000').to_a
...
time ./a_reorder.rb
Array
{362369=>90.36, 74229
{163453=>50.71, 24970
...
{910276=>86.16, 53256
{395316=>84.63, 92819
{932416=>85.71, 78133
56.1434

real	1m0.557s

Crystal version, using PG::ResultSet:

require "db"
require "pg"

module Crystal
  DB.open "postgresql://localhost/xxx" do |db|
    db.query("select * from temp order by user_id limit 1000") do |rows|
      puts rows.class
      start = Time.now

      rows.each do
        user_id, percentages = rows.read Int32, JSON::Any
        results = percentages.as_h.map { |key, value| [key.to_i, value.to_s.to_f] }.sort_by { |x| x[1] }.reverse.to_h
        puts results.inspect[0..20]
      end

      puts Time.now - start
    end
  end
end
$ time ./crystal
PG::ResultSet
{362369 => 90.36, 742
{163453 => 50.71, 249
...
{910276 => 86.16, 532
{395316 => 84.63, 928
{932416 => 85.71, 264
00:00:29.386695000

real	0m29.469s

Crystal version, using Array:

require "db"
require "pg"

module Crystal
  DB.open "postgresql://localhost/xxx" do |db|
    rows = db.query_all "select * from temp order by user_id limit 1000", as: {Int32, JSON::Any}

    puts rows.class
    start = Time.now

    rows.each do |row|
      results = row[1].as_h.map { |key, value| [key.to_i, value.to_s.to_f] }.sort_by { |x| x[1] }.reverse.to_h
      puts results.inspect[0..20]
    end

    puts Time.now - start
  end
end
$ time ./rows
Array(Tuple(Int32, JSON::Any))
{362369 => 90.36, 742
{163453 => 50.71, 249
...
{910276 => 86.16, 532
{395316 => 84.63, 928
{932416 => 85.71, 264
00:00:21.955791000

real	0m53.327s

Note that I wrote the Ruby version with sort { |x, y| y[1] <=> x[1] } instead of sort_by {}.reverse but the <=> operator didn’t work in Crystal, and I spent already enough time to make the database connection work. Hence I used sort_by {}.reverse in both languages to be fair.

There were details that surprised me. Other details didn’t.

I knew that Ruby had an excellent PostgreSQL driver. Years ago, when Rust was about v1.0, I compared Ruby vs. Rust vs. Go in raw query performance. Ruby was the fastest due to the cached prepared query. Go was the slowest.

It surprised me that converting the rows to an Array in Ruby took almost no time, and it ran for the same time as the PG::Result.each version.

It surprised me even more how slow was the the query-the-rows-as-Array version in Crystal.

You can see that both Ruby versions spent about 4 seconds on querying, fetching, and processing the result from the database. The rest of the time was spent on converting the strings to integers and floats, ordering it, and making it a Hash again.

Crystal was different. The JSON to Hash, convert strings to integers and floats, ordering, and making it a Hash again part took only 22 seconds. However, the query_all was so slow that in the end, that Crystal version was near the same speed as Ruby.

If I consider that the conversion was about 21 seconds, then the query-fetch part using db.query took only 7-8 seconds.

Conclusion:

In this test, the querying and fetching in Ruby was twice as fast (4s vs. 8s) compared to Crystal’s db.query and rows.each. Even a bit faster as the 4s includes the Ruby starting overhead. However, it’s normal. Ruby also has beaten Rust and Go in this area (many years ago when I tested it). The developer of the Rust pg driver wrote to me that the Ruby pg was excellent C code.

I wonder why the db.query_all and rows.each do |row| version in Crystal was about ten times slower than Ruby, as well as about five times slower than the other Crystal version.

As for the typecasts. I checked the Crystal driver’s JSON part. It also calls JSON.parse on the field but in Crystal, there will be an additional typecast, I believe. It’s the JSON::Any to Hash in my code. (The Ruby JSON.parse returns a Hash). Still, these casts were 2.7 times faster (if I counted right) in Crystal. JSON::Any didn’t support #map, unfortunately. In the end, I Crystal was twice as fast as Ruby, if I used the db.query version.

I’m not comparing the performance of the two languages. It was my first Crystal code, only a tiny selection of the features. Besides, most of these Ruby methods are in C.

However, it wasn’t a synthetic test either. I was comparing solutions. I have to process this data for real. The original table has a few billion records. Crystal will help.

Cheers.


#2

Hey, this is great and super interesting! If I read this correctly, Crystal beats Ruby in this in all cases, right? However, you are saying that querying and fetching is slower in Crystal. If that’s the case, I’d like to improve this. The thing I enjoy most in programming is optimizing things.

If you could set up a repo with reproducible code and seed data to try things out in order to optimize it, that would be great. Otherwise I might try to do it myself if I have time.

There’s no reason Crystal should be slower here, we talk directly to Postgres, not even using C bindings.


#3

Also, make sure to read the docs of JSON::Any. They are a wrapper over every json type. To use <=> first cast the value to the type (for example as_i, as_f, etc.).

Also, in your map call you are creating an array each time. Try using a tuple (replace the brackets with curly braces). I’m almost sure that will speed up things a bit more.

Then, why invoke value.to_s.to_f? What is value supposed to be inside the JSON, a string or a float? If it’s a float you can just invoke as_f.

Then, you could use sort_by! instead of sort_by to avoid creating an extra array (in both Ruby and Crystal).


#4

@Maeldron Hi!

I remained curious about this so I set up this locally and tried it out.

TL;DR: Crystal beats Ruby in every possible way, no matter which of the methods you presented I use.

First a question: did you compile the Crystal programs with --release? Because if I don’t then I get times that are slightly worse than Ruby. But if I do pass --release then it’s like 2~3 times faster than Ruby.

Then, using a tuple as I suggested in a previous comment here helps with the times. Also using as_f instead of to_s.to_f.

The implementation of PG in Crystal is pretty straight-forward: send data in the socket, receive data and map it to types. I can’t think that can be slower than adding Ruby’s VM overhead, unless that C code is like super excellent.