Help solving IO::Error

jwoertink · December 11, 2018, 5:25pm

I’ve asked this question on gitter channel before, but never really got an answer. The issue I have is there are so many moving parts, I have no idea where to start…

The error I’m getting most frequent is: (42k errors reported to bugsnag)

IO::Error Closed stream 
    /usr/share/crystal/src/io.cr:128:5 in 'write_byte' <file>
    lib/pg/src/pq/connection.cr:321:7 in 'query' <file>
    lib/db/src/db/pool_statement.cr:29:30 in 'query' <file>
    lib/clear/src/clear/model/collection.cr:112:38 in 'site_for_host' <file>
    src/actions/mixins/site_helpers.cr:20:5 in 'perform_action' <file>
    lib/lucky/src/lucky/route_handler.cr:10:7 in 'call' <file>
    /usr/share/crystal/src/http/server/handler.cr:24:7 in 'call_next' <file>
    src/handlers/cdn_validator.cr:5:5 in 'call' <file>
    /usr/share/crystal/src/http/server/handler.cr:24:7 in 'call' <file>
    /usr/share/crystal/src/http/server/handler.cr:24:7 in '???' <file>
    /usr/share/crystal/src/http/server/handler.cr:24:7 in 'call_next' <file>
    lib/lucky/src/lucky/log_handler.cr:20:17 in 'call' <file>
    /usr/share/crystal/src/http/server/handler.cr:24:7 in 'call' <file>
    /usr/share/crystal/src/http/server/handler.cr:24:7 in '???' <file>
    /usr/share/crystal/src/http/server/handler.cr:24:7 in 'call' <file>
    /usr/share/crystal/src/http/server/request_processor.cr:39:11 in 'process' <file>
    /usr/share/crystal/src/fiber.cr:255:3 in '???' <file>
    ??? <file>

My app is Lucky app running in production with Crystal 0.27 and Lucky 0.12. It’s running on ElasticBeanstalk, and when one of the servers gets mostly 500 errors, EB just shuts that server down. My guess is that a DB connection goes stale, and then never recovers causing this error.

I should also mention that with my Lucky setup, I’m using clear for my DB as opposed to the built-in LuckyRecord. I don’t really know where to start debugging this to see where it’s coming from, but if anyone has any ideas, I’ll try and post whatever info I can.

Thanks

Jenz · December 11, 2018, 6:59pm

The info you’ve provided says little more than that you’re somewhere trying to operate on a closed I/O stream, which the you don’t need us to tell you, as the error message says that quite clearly.

I suggest you take a look in file src/actions/mixins/site_helpers.cr line 20, column 5,
maybe post the relevant lines if you don’t see the problem.

jwoertink · December 11, 2018, 7:36pm

I’ve stared at this issue for a few months now, and it’s not apparent to me what’s going on… Here’s the whole code

module SiteHelpers


  def self.site_for_host(request : HTTP::Request) : Site?
    domain = request.host.as(String)
    sld = domain.gsub(/^www\./, "").split(".").first.gsub(/-/, "")

    if site_alias = SiteAlias.query.where { host == domain }.first
      site_alias
    else
      Site.query.where { short_name == sld }.first
    end
  end

  private def current_site : Site
    current_site?.not_nil!
  end

  private def current_site? : Site?
    @_current_site ||= SiteHelpers.site_for_host(request)
  end
end

The queries here aren’t crazy. This is also connecting to a RDS instance of postgres using initial_pool_size=5&retry_attempts=2 for the pool options.

Jenz · December 11, 2018, 8:35pm

Well, my experience with Postgres is next to none. And the error message mentions a method perform_action called in site_helpers.cr which I don’t see in your posted code, is there any macros in the code? Maybe take another look with:

crystal tool expand -c <macro_location> entrypointc.cr

jwoertink · December 11, 2018, 8:54pm

The perform_action method is called from inside Lucky. I think from this call.

sol.vin · December 11, 2018, 8:57pm

Are you sure that postgres is running? Can you verify postgres didn’t crash when running the program? Do postgres queries before this error work?

Edit: Maybe you can rescue the IO::Error and retry the query or reopen connection?

jwoertink · December 11, 2018, 9:17pm

Postgres is for sure running. The app runs fine for a few days, and then out of nowhere I just start getting this error. Terminating the instance and allowing EB to boot a new one on it’s own is the current solution.

I’m currently using the postgres retry_attempts for retying. I’m not sure how to detect dead threads in the pool and have the pool just kill them.

What I’m wondering is Clear doesn’t have built in pool like this shard. It just calls DB.open. Maybe Clear would need to add in pool options to handle this?

asterite · December 11, 2018, 11:47pm

DB.open, and in general crystal-db, have a built-in pool:

github.com

crystal-lang/crystal-db/blob/master/src/db/pool.cr

require "weak_ref"

module DB
  class Pool(T)
    @initial_pool_size : Int32
    # maximum amount of objects in the pool. Either available or in use.
    @max_pool_size : Int32
    @available = Set(T).new
    @total = [] of T
    @checkout_timeout : Float64
    # maximum amount of retry attempts to reconnect to the db. See `Pool#retry`.
    @retry_attempts : Int32
    @retry_delay : Float64

    def initialize(@initial_pool_size = 1, @max_pool_size = 0, @max_idle_pool_size = 1, @checkout_timeout = 5.0,
                   @retry_attempts = 1, @retry_delay = 0.2, &@factory : -> T)
      @initial_pool_size.times { build_resource }

      @availability_channel = Channel(Nil).new
      @waiting_resource = 0

This file has been truncated. show original

There’s a big chance this is a bug in that pool implementation. I would create an issue there (it seems that are some related issues, maybe yours falls under one of them).

girng · December 12, 2018, 2:07am

Last paragraph in the OP, first sentence. Might be related :/

Actually, you can try ysbaddaden’s fix here. Maybe that will help

jwoertink · December 12, 2018, 4:33pm

Thanks @girng and @asterite for that info. I’m gonna try that fix out and see if it helps at all. If it is a crystal-db issue, then that at least points me in some direction that I can follow. I’ll report back if I see this issue go away, or if I’m still getting it.

rogerdpack · December 13, 2018, 5:31am

It would be interesting to check if you are getting more and more connections over time…hmm…

jwoertink · December 13, 2018, 5:20pm

I made that change as suggested here. We let that sit for the day, and things actually got a little worse. We currently run 3 servers, and we will normally see 1 server tank from too many IO::Error. Last night we had all 3 tank with Broken Pipe errors. I guess it technically did fix the IO error since we stopped getting those, but it seemed to make things a bit less stable.

Since it’s on RDS I was able to just look at the connections over the past week, and they hold pretty steady around 185 connections. I don’t really see it climbing.

rogerdpack · December 13, 2018, 5:43pm

At first blush maybe it’s losing connection with the server and not reconnecting…Wonder if it’s possible to test with a “raw” Postgres connection instead of using DB, which is pooled? You could also setup an ongoing tcpdump so you could try and catch when the problem first occurs…maybe :)

jwoertink · December 13, 2018, 6:03pm

That all sounds great! How do I do that? lol

Would this tcpdump go in the crystal app? Or is this something that lives elsewhere? My guess is I could setup some random endpoint that makes that raw connection, and then a cron that just pings that endpoint every few seconds or something…

rogerdpack · December 13, 2018, 8:01pm

tcpdump on the postgres port, sorry I don’t know how off hand. Yeah endpoint would be good…

girng · December 14, 2018, 6:50pm

@jwoertink. Please let me/us know if you find something that works, I don’t want to run into the same problem as you haha.

jwoertink · December 14, 2018, 7:37pm

Yeah for sure. So far nothing. I reverted the last change I made and it’s holding steady, just a few errors here and there. Luckily EB will auto shut down an instance hitting too many 500 errors, and then reboot a new one.

ysbaddaden · December 16, 2018, 9:32am

It sounds like some connections are cut abruptly, with neither DB nor the server acknowledgement.

Is it possible for connections to stay idle? Like there are more connections than necessary, so some are always used, while some are rarely used? A switch could just decide to cut long lived connections, leading to an IO::Error when it’s eventually read/written to.

Is DB::Pool just reporting the IO::Error and trying to connect again? Or maybe the connection is pulled, you get the error and don’t remove it from the pool (in an ensure block)?

jwoertink · December 16, 2018, 9:07pm

I honestly have no idea what it’s doing. The code throwing the actual error is pasted above. I’m not doing anything with DB or the pool directly. It’s possible I’m using too many connections; however, I’m not specifying too many as it is. ?initial_pool_size=5&retry_attempts=2. I just have those options specified with my postgres url.

@ysbaddaden do you have any suggestions on how I can get some better data or insight in to this? Maybe I fork all the shards and add some code somewhere that would let me at least log something that could give you a hint as to what it’s doing?

sdzyba · December 28, 2018, 2:24pm

@jwoertink have you tried a new version of Clear? It seems like they introduced their own connection pool implementation, therefore your problem might be gone.

Topic		Replies	Views
New app, new errors! Help & Support	10	1058	February 19, 2020
Unhandled exception: Error writing file: Broken pipe (Errno) Help & Support	8	1374	July 31, 2022
Unable to get info: Bad file descriptor (IO::Error) Help & Support	15	371	May 7, 2024
Debugging segfaults further Help & Support	2	330	May 25, 2019
Handling the new dependency resolver Help & Support	9	670	July 1, 2020

Help solving IO::Error

Related topics