Help solving IO::Error


#1

I’ve asked this question on gitter channel before, but never really got an answer. The issue I have is there are so many moving parts, I have no idea where to start…

The error I’m getting most frequent is: (42k errors reported to bugsnag)

IO::Error Closed stream 
    /usr/share/crystal/src/io.cr:128:5 in 'write_byte' <file>
    lib/pg/src/pq/connection.cr:321:7 in 'query' <file>
    lib/db/src/db/pool_statement.cr:29:30 in 'query' <file>
    lib/clear/src/clear/model/collection.cr:112:38 in 'site_for_host' <file>
    src/actions/mixins/site_helpers.cr:20:5 in 'perform_action' <file>
    lib/lucky/src/lucky/route_handler.cr:10:7 in 'call' <file>
    /usr/share/crystal/src/http/server/handler.cr:24:7 in 'call_next' <file>
    src/handlers/cdn_validator.cr:5:5 in 'call' <file>
    /usr/share/crystal/src/http/server/handler.cr:24:7 in 'call' <file>
    /usr/share/crystal/src/http/server/handler.cr:24:7 in '???' <file>
    /usr/share/crystal/src/http/server/handler.cr:24:7 in 'call_next' <file>
    lib/lucky/src/lucky/log_handler.cr:20:17 in 'call' <file>
    /usr/share/crystal/src/http/server/handler.cr:24:7 in 'call' <file>
    /usr/share/crystal/src/http/server/handler.cr:24:7 in '???' <file>
    /usr/share/crystal/src/http/server/handler.cr:24:7 in 'call' <file>
    /usr/share/crystal/src/http/server/request_processor.cr:39:11 in 'process' <file>
    /usr/share/crystal/src/fiber.cr:255:3 in '???' <file>
    ??? <file>

My app is Lucky app running in production with Crystal 0.27 and Lucky 0.12. It’s running on ElasticBeanstalk, and when one of the servers gets mostly 500 errors, EB just shuts that server down. My guess is that a DB connection goes stale, and then never recovers causing this error.

I should also mention that with my Lucky setup, I’m using clear for my DB as opposed to the built-in LuckyRecord. I don’t really know where to start debugging this to see where it’s coming from, but if anyone has any ideas, I’ll try and post whatever info I can.

Thanks


#2

The info you’ve provided says little more than that you’re somewhere trying to operate on a closed I/O stream, which the you don’t need us to tell you, as the error message says that quite clearly.

I suggest you take a look in file src/actions/mixins/site_helpers.cr line 20, column 5,
maybe post the relevant lines if you don’t see the problem. :slightly_smiling_face:


#3

I’ve stared at this issue for a few months now, and it’s not apparent to me what’s going on… Here’s the whole code

module SiteHelpers


  def self.site_for_host(request : HTTP::Request) : Site?
    domain = request.host.as(String)
    sld = domain.gsub(/^www\./, "").split(".").first.gsub(/-/, "")

    if site_alias = SiteAlias.query.where { host == domain }.first
      site_alias
    else
      Site.query.where { short_name == sld }.first
    end
  end

  private def current_site : Site
    current_site?.not_nil!
  end

  private def current_site? : Site?
    @_current_site ||= SiteHelpers.site_for_host(request)
  end
end

The queries here aren’t crazy. This is also connecting to a RDS instance of postgres using initial_pool_size=5&retry_attempts=2 for the pool options.


#4

Well, my experience with Postgres is next to none. And the error message mentions a method perform_action called in site_helpers.cr which I don’t see in your posted code, is there any macros in the code? Maybe take another look with:

crystal tool expand -c <macro_location> entrypointc.cr

#5

The perform_action method is called from inside Lucky. I think from this call.


#6

Are you sure that postgres is running? Can you verify postgres didn’t crash when running the program? Do postgres queries before this error work?

Edit: Maybe you can rescue the IO::Error and retry the query or reopen connection?


#7

Postgres is for sure running. The app runs fine for a few days, and then out of nowhere I just start getting this error. Terminating the instance and allowing EB to boot a new one on it’s own is the current solution.

I’m currently using the postgres retry_attempts for retying. I’m not sure how to detect dead threads in the pool and have the pool just kill them.

What I’m wondering is Clear doesn’t have built in pool like this shard. It just calls DB.open. Maybe Clear would need to add in pool options to handle this?


#8

DB.open, and in general crystal-db, have a built-in pool:

There’s a big chance this is a bug in that pool implementation. I would create an issue there (it seems that are some related issues, maybe yours falls under one of them).


#9


Last paragraph in the OP, first sentence. Might be related :/

Actually, you can try ysbaddaden’s fix here. Maybe that will help


#10

Thanks @girng and @asterite for that info. I’m gonna try that fix out and see if it helps at all. If it is a crystal-db issue, then that at least points me in some direction that I can follow. I’ll report back if I see this issue go away, or if I’m still getting it.


#11

It would be interesting to check if you are getting more and more connections over time…hmm…


#12

I made that change as suggested here. We let that sit for the day, and things actually got a little worse. We currently run 3 servers, and we will normally see 1 server tank from too many IO::Error. Last night we had all 3 tank with Broken Pipe errors. I guess it technically did fix the IO error since we stopped getting those, but it seemed to make things a bit less stable.

Since it’s on RDS I was able to just look at the connections over the past week, and they hold pretty steady around 185 connections. I don’t really see it climbing.


#13

At first blush maybe it’s losing connection with the server and not reconnecting…Wonder if it’s possible to test with a “raw” Postgres connection instead of using DB, which is pooled? You could also setup an ongoing tcpdump so you could try and catch when the problem first occurs…maybe :)


#14

That all sounds great! How do I do that? lol

Would this tcpdump go in the crystal app? Or is this something that lives elsewhere? My guess is I could setup some random endpoint that makes that raw connection, and then a cron that just pings that endpoint every few seconds or something…


#15

tcpdump on the postgres port, sorry I don’t know how off hand. Yeah endpoint would be good…


#16

@jwoertink. Please let me/us know if you find something that works, I don’t want to run into the same problem as you haha.


#17

Yeah for sure. So far nothing. I reverted the last change I made and it’s holding steady, just a few errors here and there. Luckily EB will auto shut down an instance hitting too many 500 errors, and then reboot a new one.


#18

It sounds like some connections are cut abruptly, with neither DB nor the server acknowledgement.

Is it possible for connections to stay idle? Like there are more connections than necessary, so some are always used, while some are rarely used? A switch could just decide to cut long lived connections, leading to an IO::Error when it’s eventually read/written to.

Is DB::Pool just reporting the IO::Error and trying to connect again? Or maybe the connection is pulled, you get the error and don’t remove it from the pool (in an ensure block)?


#19

I honestly have no idea what it’s doing. The code throwing the actual error is pasted above. I’m not doing anything with DB or the pool directly. It’s possible I’m using too many connections; however, I’m not specifying too many as it is. ?initial_pool_size=5&retry_attempts=2. I just have those options specified with my postgres url.

@ysbaddaden do you have any suggestions on how I can get some better data or insight in to this? Maybe I fork all the shards and add some code somewhere that would let me at least log something that could give you a hint as to what it’s doing?


#21

@jwoertink have you tried a new version of Clear? It seems like they introduced their own connection pool implementation, therefore your problem might be gone.