Any ideas on odd MT behavior with PG?

Okay, managed to… I think break it entirely? Can’t get DB errors if it won’t run, nailed it.

In terms of how the DB is used, very early in the startup it does:

PG_DB = DB.open(DB_CONNECTION_STRING)

(there is also some code that captures ctrl-c and closes the DB if observed)

Then it just slammed that with:

channel = Channel(nil).new()
@arr_objs.each do |obj|
    spawn do 
        obj.init(p1, p2) 
        channel.send(nil)   
    end
end
@arr_objs.size.times { channel.receive }

@arr_ojbs is an array of 120 structs.

In obj.init there are a some methods that get called that hit the DB. Each query takes less than a second, but there might be a bunch of them (I actually haven’t measured).

First thing I did was add to the DB_CONNECTION_STRING:
?retry_attempts=30&checkout_timeout=60

That made a huge difference and I stopped getting the previous issue and now the errors were either the scalar got “no results”, a lost connection, a closed stream, or an “end of file” error all out of the DB (or DB Connection).

Then I tried adding to the connection string:
prepared_statements=false
That broke everything - didn’t bother to look too hard into why as I suspected even if it worked I would have perf issues and so I would rather it did work, if possible.

So I went back to the last time I posted here a few years back and added the two channels suggestion to my code - one to manage some workers to control the DB consumption and one to manage the work being done by them.

This now varies between working perfectly, hanging infinitely on the last item with no cpu load, or throwing a “no results” DB error in a new spot that it wasn’t getting errors in (ever) prior.

(when it was working, I thought maybe it was all set, so I upped by CRYSTAL_WORKERS and it immediately started failing back in the original way)

The “no results” is fascinating, because in all cases of it, there is literally no way the DB itself will have no results - in all of these cases I have had it log the SQL it is using and I can take that and run it in DataGrip and get back data with no issue - so it is something about dropping connections due to overwhelming the pool or something and the goofiness that is happening amidst that. And like I said, sometimes it works for many rounds and then fails. There is no randomization purposely happening in the code (vs like this comes up randomly bc of system load or something).

The last time this happened I just gave up and stopped using MT when using the DB.

But this time around, when it does actually work, it is so much faster that it is just too much a kick in the gut to have to be single threaded for the DB parts. It literally cuts days off the run time (this, like last time I posted, is monte carlo code that runs a bunch of simulations - the DB part is just initializing the objects that will be in the actual simulation - the DB doesn’t get hit in the sim stuff, and that all does MT perfectly fine - it is the DB part that is somehow failing in what seems like indeterministic ways).

I am going to see what else I can debug out of it to try and figure out what is happening.