Any ideas on odd MT behavior with PG?

omgponies · November 12, 2024, 3:58pm

I have code that talks to a PostgreSQL database, using crystal’s built-in db and pg code. I run with “-Dpreview_mt”.

It sometimes, but not always, fails with this error:

Unhandled exception in spawn: more than one row (DB::Error)

I read that and to me that reads as “in your MT code, it is trying to talk to the DB, and something that is expecting a single row to be returned is getting more than one row back from the DB, and you are not handling the exception.”

Reading through the output of that error, sure enough it says it is in the db query_methods “query_one:as”. And sure enough, it is in a method of mine that uses that - so it seems pretty obvious here, fix my bad SQL.

Except this is the SQL:
select max(some_int_field) as blah from some_table where some_other_int_field=3

Double checking my PostgreSQL version number (the latest) and their docs, it confirms my memory that PostgreSQL only ever returns a single row for the “max” function.

So what gives? Is this a bug with Crystal’s pg and how it handles MT?

Can I do something in my code to avoid this? (other than not using MT?)

Is the fix to just not use “query_one” and instead use “query” and break after the first row?

I would love to know what is going on that creates this situation.

Blacksmoke16 · November 12, 2024, 4:06pm

Not sure if it’ll solve your problem or not, but if you’re only expecting a single value, maybe switch to #scalar instead?

omgponies · November 12, 2024, 5:44pm

I refactored all references to use scalar instead - thank you. If the underlying issue was due to the recordset(s) and using scalar bypasses that, should do it - so far so good, but as noted before it doesn’t always happen, so I will keep my eye on it. In theory should be a bit faster this way too (or at least I would assume it uses fewer resources) - silly of me to not see that in the docs - thank you so much!

UPDATE: after more testing - it definitely seems to be fine with scalar - so something either with recordsets or with query_one directly when in mt/pools. Thank you again!

straight-shoota · November 12, 2024, 6:42pm

scalar and query_one are based on the same query implementation. They’re just convenience wrappers. So the actual behaviour should be identical.
Simplified they are implemented like this:

def scalar
  query do |rs|
    rs.each do
      return rs.read
    end
  end
  raise NoResultsError.new
end

def query_one
  query do |rs|
    raise NoResultsError.new unless rs.move_next
    value = rs.read
    raise DB::Error.new("more than one row") if rs.move_next
    return value
  end
end

So the only significant difference is that query_one actually checks that there is no second row in the result. scalar on the other hand takes only the first one and doesn’t care if there are more.

So changing from query_one to scalar might actually be just silencing the error.

omgponies · November 12, 2024, 7:55pm

Fair enough. Still not sure how an aggregate in pg is returning more than one row.

straight-shoota · November 12, 2024, 8:54pm

You could try using query and debug log the second (or more?) row.

Maybe something is going wrong with multi-threading. Although I would assume db and pg are thread-safe (not sure about that though). And a thread safety issue would be likely to not only yield unexpected rows but also break in other way.

bcardiff · November 12, 2024, 9:50pm

Are you sharing/keeping the connections through threads somehow? Or are you running the query through the pool directly?

omgponies · November 12, 2024, 10:01pm

The latter, I believe.
But several years ago when I was last trying something similar to this I had a different issue and both you and straight-shoota weighed in at that time as well. I was just reviewing that post and see that I could potentially try some of those things on this latest one to see if it changes any of this behavior.

Unfortunately won’t have time again to look at it for a few days I don’t think, but will revisit at that time to see if can either debug as striaght-shoota suggests and/or if I can improve how I (ab)use the pooling.

omgponies · November 13, 2024, 10:53pm

Okay, managed to… I think break it entirely? Can’t get DB errors if it won’t run, nailed it.

In terms of how the DB is used, very early in the startup it does:

PG_DB = DB.open(DB_CONNECTION_STRING)

(there is also some code that captures ctrl-c and closes the DB if observed)

Then it just slammed that with:

channel = Channel(nil).new()
@arr_objs.each do |obj|
    spawn do 
        obj.init(p1, p2) 
        channel.send(nil)   
    end
end
@arr_objs.size.times { channel.receive }

@arr_ojbs is an array of 120 structs.

In obj.init there are a some methods that get called that hit the DB. Each query takes less than a second, but there might be a bunch of them (I actually haven’t measured).

First thing I did was add to the DB_CONNECTION_STRING:
?retry_attempts=30&checkout_timeout=60

That made a huge difference and I stopped getting the previous issue and now the errors were either the scalar got “no results”, a lost connection, a closed stream, or an “end of file” error all out of the DB (or DB Connection).

Then I tried adding to the connection string:
prepared_statements=false
That broke everything - didn’t bother to look too hard into why as I suspected even if it worked I would have perf issues and so I would rather it did work, if possible.

So I went back to the last time I posted here a few years back and added the two channels suggestion to my code - one to manage some workers to control the DB consumption and one to manage the work being done by them.

This now varies between working perfectly, hanging infinitely on the last item with no cpu load, or throwing a “no results” DB error in a new spot that it wasn’t getting errors in (ever) prior.

(when it was working, I thought maybe it was all set, so I upped by CRYSTAL_WORKERS and it immediately started failing back in the original way)

The “no results” is fascinating, because in all cases of it, there is literally no way the DB itself will have no results - in all of these cases I have had it log the SQL it is using and I can take that and run it in DataGrip and get back data with no issue - so it is something about dropping connections due to overwhelming the pool or something and the goofiness that is happening amidst that. And like I said, sometimes it works for many rounds and then fails. There is no randomization purposely happening in the code (vs like this comes up randomly bc of system load or something).

The last time this happened I just gave up and stopped using MT when using the DB.

But this time around, when it does actually work, it is so much faster that it is just too much a kick in the gut to have to be single threaded for the DB parts. It literally cuts days off the run time (this, like last time I posted, is monte carlo code that runs a bunch of simulations - the DB part is just initializing the objects that will be in the actual simulation - the DB doesn’t get hit in the sim stuff, and that all does MT perfectly fine - it is the DB part that is somehow failing in what seems like indeterministic ways).

I am going to see what else I can debug out of it to try and figure out what is happening.

omgponies · November 14, 2024, 2:55pm

Still using this at the root and then when classes call to the DB, they use that.

DB_CONNECTION_STRING = "postgres://#{ENV["DB_USERNAME"]}:#{ENV["DB_PASSWORD"]}@#{ENV["DB_HOST"]}:#{ENV["DB_PORT"]}/#{ENV["DB_NAME"]}?max_pool_size=50&retry_attempts=30&checkout_timeout=60"      
PG_DB = DB.open(DB_CONNECTION_STRING)

I have modified the spawn code from this:

channel = Channel(nil).new()
@arr_objs.each do |obj|
    spawn do 
        obj.init(p1, p2) 
        channel.send(nil)   
    end
end
@arr_objs.size.times { channel.receive }

To this:

workers_count = 5
jobs = Channel(SomeObj).new(workers_count)
channel = Channel(UInt8).new(@arr_objs.size)
workers_count.times do
    spawn do
        while (obj = jobs.receive)
            obj.init(@arr_details, @str_date)
            channel.send(1_u8)
        end
    end
end

@arr_objs.each do |obj|
    jobs.send(obj)
end       

counter = 0
@arr_objs.size.times do
    puts "X: #{counter += 1}/#{@arr_objs.size}"
    channel.receive
    puts "Y: #{counter}/#{@arr_objs.size}"
end

Then I did show max_connections on my DB, it is 100. So I added max_pool_size=50 in the DB connection string.

Then while it is in the part where it is using MT and hitting the DB, I monitor select * from pg_stat_activity and the most connections I ever see in there are my DataGrip connection, the first connection from the DB.open, and then 5-8 others from the MT code.

The X/Y code will frequently just stop 1-3 short from the end and hang infinitely with no errors to the console - but the CPU load is nothing (from them and minor from other running things on the computer) and the DB shows only the first connection active (from the global DB.open).

It seems to have settle down on a pattern here of four things that might happen for each iteration of arr_objs and/or the program itself (the contents of arr_objs are always the same every time I run this program, and are in the same order - yet where this fails each time I run it seems to be entirely random - it does not always fail in the same spot):

it works fine
no exceptions are thrown, but it hangs somewhere around the last 1-3 (e.g. if arr_objs.size is 120, then it will hang on 119, if it chooses to hang - no activity in the DB or CPU and will sit there infinitely until killed - it will show the X debug line on the console but never get to the Y - it also sometimes does this 3 short… not sure if relevant, but I haven’t yet seen it do 2 short, but suspect that’s just random vs meaningful)
it throws an exception saying no results were returned from an aggregate (which is sort of the opposite of what it was doing before some of these adjustments, saying multiple results were returned from an aggregate, which just isn’t possible via the SQL itself) - if I look at the SQL used, it returns as you would expect - so it seems to be executing it oddly vs a bug in the SQL itself
it throws dual exceptions concurrently (see below):

(these have been truncated to remove the parts of my functions getting called as I don’t think they are relevant)

Unhandled exception in spawn: Expected PQ::Frame::ParseComplete but got PQ::Frame::CommandComplete(@bytes=Bytes[83, 69, 76, 69, 67, 84, 32, 49, 0]) (Exception)
  from /Users/user/.cache/crystal/crystal-run-pga.tmp in 'Exception::CallStack::unwind:Array(Pointer(Void))'
  from /Users/user/.cache/crystal/crystal-run-somecode.tmp in 'Exception::CallStack#initialize:Array(Pointer(Void))'
  from /Users/user/.cache/crystal/crystal-run-somecode.tmp in 'Exception::CallStack::new:Exception::CallStack'
  from /Users/user/.cache/crystal/crystal-run-somecode.tmp in 'raise<Exception>:NoReturn'
  from /Users/user/.cache/crystal/crystal-run-somecode.tmp in 'raise<String>:NoReturn'
  from /Users/user/.cache/crystal/crystal-run-somecode.tmp in 'PQ::Connection#expect_frame<PQ::Frame::ParseComplete.class, Nil>:PQ::Frame::ParseComplete'
  from /Users/user/.cache/crystal/crystal-run-somecode.tmp in 'PQ::Connection#expect_frame<PQ::Frame::ParseComplete.class>:PQ::Frame::ParseComplete'
  from /Users/user/.cache/crystal/crystal-run-somecode.tmp in 'PG::Statement#perform_query<Tuple(Int32)>:PG::ResultSet'
  from /Users/user/.cache/crystal/crystal-run-somecode.tmp in 'DB::Statement+@DB::Statement#perform_query_with_rescue<Tuple(Int32)>:PG::ResultSet'
  from /Users/user/.cache/crystal/crystal-run-somecode.tmp in 'DB::Statement+@DB::Statement#query:args<Int32, Nil>:PG::ResultSet'
  from /Users/user/.cache/crystal/crystal-run-somecode.tmp in 'DB::PoolStatement+@DB::PoolStatement#query:args<Int32, Nil>:PG::ResultSet'
  from /Users/user/.cache/crystal/crystal-run-somecode.tmp in 'DB::Database@DB::QueryMethods(Stmt)#query:args<String, Int32, Nil>:PG::ResultSet'
  from /Users/user/.cache/crystal/crystal-run-somecode.tmp in 
...
  from /Users/user/.cache/crystal/crystal-run-somecode.tmp in 'Fiber#run:(IO::FileDescriptor | Nil)'
  from /Users/user/.cache/crystal/crystal-run-somecode.tmp in '~proc3Proc(Fiber, (IO::FileDescriptor | Nil))@/opt/homebrew/Cellar/crystal/1.13.3_1/share/crystal/src/fiber.cr:95'

and

Unhandled exception in spawn: Expected PQ::Frame::CommandComplete but got PQ::Frame::ReadyForQuery(@transaction_status=PQ::Frame::ReadyForQuery::Status::Idle) (Exception)
  from lib/pg/src/pq/connection.cr:447:7 in 'expect_frame'
  from lib/pg/src/pq/connection.cr:430:9 in 'read_next_row_start'
  from lib/pg/src/pg/result_set.cr:49:8 in 'move_next'
  from lib/pg/src/pg/result_set.cr:208:20 in 'do_close'
  from lib/db/src/db/disposable.cr:11:7 in 'close'
  from lib/db/src/db/statement.cr:19:23 in 'scalar:args'
  from lib/db/src/db/pool_statement.cr:34:30 in 'scalar:args'
  from lib/db/src/db/query_methods.cr:284:7 in 'scalar'
...

Given the max_connections is 100 and in practice I appear to be maxing out around 8, I don’t seem to be overwhelming the connections - or any resources on the OS or DB for that matter.

So am I “Just Doing It Wrong” or is there something larger at play out of my control, like a bug in the db pg code around mt?

I am being very lazy and not doing anything about the exceptions and just letting it fall flat - should I more gracefully handle them and retry or anything more sophisticated than that?

Any suggestions on things I should try/test? I would really love to not have to run the DB parts serially - particularly when this will eventually run on a machine with many cores in “production” (this is just personal code that makes me money, but it is nothing commercial).

omgponies · November 14, 2024, 3:49pm

The bulk of my database calls were already in a dedicated class - so I just made that class initialize and finalize with this:

@db : DB::Database

def initialize
    @db = DB.open(DB_CONNECTION_STRING)
end

def finalize
   @db.close
end

Then swapped out, in that class, all db calls from PG_DB to @db.

That way, presumably each obj off of arr_objs spawns its own version of this, gets its own connection pool, and then closes it out when it is done.

Watching the connections and load on the DB, it started reasonable but over a few iterations of creating new arr_objs it got much higher than before - so I lowered the max_pool_size in the connection string from 50 to 10 and also lowered the checkout_timeout to 10 from 60 (otherwise it eventually hits the max_connections of the DB) - still need to do more restarts of it to confirm it isn’t a fluke - but this way it seems them having their own pool vs them having to navigate the larger one seems to have resolved whatever was going on.

(I may be misunderstanding what it is actually doing, or maybe was before - this sort of thing always makes me question if I understand anything at all - but regardless, it seems to have resolved my issue described here, so leaving this in case future searchers run into something like it - or myself in a few years at this rate.)

omgponies · November 14, 2024, 5:29pm

Actually that appears to eventually blow up the connections.

I upped the DB max_connections from 100 to 200, but that also would get consumed (I haven’t tested it, but I highly suspect that it will consume any amount).

I adjusted the max_pool_size=initial_pool_size=max_idle_pool_size and set them all to 2.

With 5 workers (as in the workers_count and channel related to that, not the CRYSTAL_WORKERS env) the DB connections would for a bit stay as expected, down around 5-10 connections - but at some point something was happening and it would blow up to consume all of the DB connections until it fell over saying no more.

(using this to observe connections select count(*) as my_connections from pg_stat_activity where application_name = 'crystal' - I was only running a single thing that would show up in there as crystal)

I was going to try using “1” on those, but at that point that is a single connection, right?

So while I would have thought 5 at a time, it would init 2 connections, and so would expect around 10 connections during this - which is basically what happens, and then at some point it just explodes and consumes all of them.
Looking at htop during this time, a bunch of the processes are showing as idle vs actually running queries.

Since the given class of mine that does the database work only needs a single connection, I adjusted the initialize and declaration to:

@db : PG::Connection
def initialize
    @db = DB.connect(DB_CONNECTION_STRING)
end

Even with the individual connection, it still gets some spots where it grows more than expected, but it is at least manageable (e.g. like 20-40ish connections).

I am not sure what is going on with the connection pool, but there seems to be a way to get it to blow up and continue to consume connections even if not using them.

Topic		Replies	Views
Issue with crashes/hangs with fibers/channel and PG DB Help & Support	11	560	March 13, 2021
Help solving IO::Error Help & Support	24	1979	January 21, 2019
Database, Array, Hash Learning Resources	3	1334	February 19, 2019
BUG: trying to downcast DB::ResultSet+ <- SQLite3::ResultSet (Exception) #9594 Help & Support	0	225	July 10, 2020
Am I connecting to the database correctly? Help & Support	2	403	August 13, 2019

Any ideas on odd MT behavior with PG?

Related topics