New app, new errors!

Hey all, so I just got my latest Lucky app up in production, and the app has to handle a ton of traffic.

It’s been a little shaky today which is strange because every other app we’ve launched has been rock solid. I saw these errors come through, and I need some help decipher what they mean, and how I can go about trying to fix them.

This one:

Errno Error reading socket: Bad file descriptor 
    /usr/share/crystal/src/gc/boehm.cr:91:5 in 'unbuffered_read' <file>
    /usr/share/crystal/src/io/buffered.cr:79:16 in 'read' <file>
    BIO_read <file>
    ??? <file>
    ??? <file>
    ??? <file>
    /usr/share/crystal/src/openssl/ssl/socket.cr:116:5 in 'fill_buffer' <file>
    /usr/share/crystal/src/indexable.cr:403:3 in 'read_char' <file>
    lib/pg/src/pq/connection.cr:157:12 in 'expect_frame' <file>
    lib/pg/src/pg/statement.cr:19:5 in 'perform_query' <file>
    lib/db/src/db/statement.cr:88:14 in 'query:args' <file>
    lib/db/src/db/query_methods.cr:61:12 in 'calculate_assets' <file>
    lib/disc_jockey/src/disc_jockey/burns/decorator.cr:19:7 in 'new' <file>
    /usr/share/crystal/src/enumerable.cr:447:5 in 'perform_action' <file>
    lib/lucky/src/lucky/route_handler.cr:0:7 in 'call' <file>
    /usr/share/crystal/src/http/server/handler.cr:26:7 in 'call' <file>
    /usr/share/crystal/src/http/server/handler.cr:26:7 in 'call' <file>
    /usr/share/crystal/src/http/server/handler.cr:26:7 in 'call' <file>
    /usr/share/crystal/src/http/server/handler.cr:26:7 in 'call' <file>
    /usr/share/crystal/src/http/server/handler.cr:26:7 in 'call' <file>
    lib/lucky/src/lucky/context_extensions.cr:13:5 in 'call' <file>
    /usr/share/crystal/src/http/server/request_processor.cr:48:11 in 'process' <file>
    /usr/share/crystal/src/fiber.cr:255:3 in 'run' <file>
    ??? <file>

and this one

Errno Error writing to socket: Broken pipe 
    /usr/share/crystal/src/socket.cr:0:13 in 'unbuffered_write' <file>
    /usr/share/crystal/src/http/server/response.cr:217:11 in 'unbuffered_write' <file>
    lib/lucky/src/lucky/renderable.cr:98:5 in 'perform_action' <file>
    lib/lucky/src/lucky/route_handler.cr:0:7 in 'call' <file>
    /usr/share/crystal/src/http/server/handler.cr:26:7 in 'call' <file>
    /usr/share/crystal/src/http/server/handler.cr:26:7 in 'call' <file>
    /usr/share/crystal/src/http/server/handler.cr:26:7 in 'call' <file>
    /usr/share/crystal/src/http/server/handler.cr:26:7 in 'call' <file>
    /usr/share/crystal/src/http/server/handler.cr:26:7 in 'call' <file>
    lib/lucky/src/lucky/context_extensions.cr:13:5 in 'call' <file>
    /usr/share/crystal/src/http/server/request_processor.cr:48:11 in 'process' <file>
    /usr/share/crystal/src/fiber.cr:255:3 in 'run' <file>
    ??? <file>

I’m running on Docker using Crystal 0.31.1 on Lucky 0.18.1

The first one looks like it might be a bug in the crystal-pg shard — Bad file descriptor usually means the wrong fd is being passed to a socket function. Are you running the same version of that shard on this app as you are your other apps? If not, it might be worth downgrading to a version you’re using in prod to see if the same issue crops up.

Your second error can happen if a network partition happens — including even a change in wifi connection status. It’s like the “No route to host” for a ping, but it started after the TCP socket reached its open state. I think it can also happen if the client closes the connection, but that may be “Connection reset by peer” instead of “Broken pipe”. I can’t remember the details. Either way, it’s a pretty common network thing and not caused by your app. :-)

1 Like

Ah, interesting. I just noticed that the other apps are actually using a later version. I’ll upgrade and see if that fixes the first one.

The apps are on elasticbeanstalk running docker behind cloudfront. There’s lots of different layers, so I guess that makes sense. I just wish there was a better way to handle that so we’re not just dropping traffic randomly.

Thanks for the help!

Well, I tried putting this app to the same version of the shards as the other stable ones, and this one went down again in middle of the night. It actually gets less traffic than the other one, and is using more efficient SQL queries. Looking at all the logs and graphs, it seems to be something in crystal. The interesting thing is the older app is running Crystal 0.29.0, and this one is running 0.31.1. I guess I could try and upgrade, but downgrading might be a bit difficult to do.

Alright, after some serious debugging, we have a somewhat repeatable way to tank our app. It doesn’t seem to take too much effort, but for example if someone is scraping our site, then we get this:

GC Warning: Failed to expand heap by 607793152 bytes
GC Warning: Failed to expand heap by 574238720 bytes
GC Warning: Out of Memory! Heap size: 1174 MiB. Returning NULL!
Invalid memory access (signal 11) at address 0x0
[0x6a2fd6] ???
[0x451c1f] __crystal_sigfault_handler +479
[0x7f8f8abf4390] ???
[0x7f8f8a123a9e] ???
[0x4ec16a] ???
[0x4fd5bb] ???
[0x4fc4be] ???
[0x512750] ???
[0x588e0d] ???
[0x5cb50c] ???
[0x62b3ac] ???
[0x66c1fa] ???
[0x546b1f] ???
[0x53182c] ???
[0x5312c1] ???
[0x52691f] ???
[0x526431] ???
[0x52577e] ???
[0x52591b] ???
[0x68e480] ???
[0x4d2a1a] ???
[0x44d29b] ???
[0x0] ???

Then after enough times of doing this, we get this fun one: (tailing docker logs)

*** Error in `/app/app': corrupted double-linked list: 0x0000000001cf1030 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f7af95b27e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x7d814)[0x7f7af95b8814]
/lib/x86_64-linux-gnu/libc.so.6(+0x8288e)[0x7f7af95bd88e]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7f7af95bf184]
/lib/x86_64-linux-gnu/libcrypto.so.1.0.0(CRYPTO_malloc+0x58)[0x7f7afa965e78]
/lib/x86_64-linux-gnu/libssl.so.1.0.0(+0x28a05)[0x7f7afad70a05]
/lib/x86_64-linux-gnu/libssl.so.1.0.0(+0x28b59)[0x7f7afad70b59]
/lib/x86_64-linux-gnu/libssl.so.1.0.0(+0x2ab6c)[0x7f7afad72b6c]
/app/app[0x4e3b62]
/app/app[0x4e1b8f]
....tons more here...
/app/app[0x44c61b]
======= Memory map: ========
00400000-00753000 r-xp 00000000 fd:04 2888380                            /app/app
00952000-00953000 r--p 00352000 fd:04 2888380                            /app/app
00953000-00955000 rw-p 00353000 fd:04 2888380                            /app/app
00955000-00bab000 rw-p 00000000 00:00 0 
01b01000-02937000 rw-p 00000000 00:00 0                                  [heap]
7f7a88000000-7f7a88021000 rw-p 00000000 00:00 0 
7f7a88021000-7f7a8c000000 ---p 00000000 00:00 0 
.....and this goes on forever....

This also seems to be similar to what is here: https://github.com/omarroth/invidious/issues/773

If anyone has any clue as to direction I should look in to, that would be greatly helpful

An out-of-memory error seems like it might be a memory leak somewhere or you’re serving more concurrent volume than the box has memory for. You’re using over 1GB of RAM so unless you’re just serving a lot of massive HTTP responses concurrently (do you have any metrics on concurrent requests and size of responses?), there may be a leak somewhere.

In the meantime, it might be worth running on beefier hardware so you can see your memory usage climb in your metrics for a bit longer before it ‘splodes. If uptime for this app is paramount, you could also reboot it every X minutes or so to keep it going.

Yeah, the app is hosted on 2 t2.medium EC2 instances. We thought about bumping that up, but all the metrics AWS shows us says the CPU sits around 2% usage, and memory sits super low UNTIL we had someone start scraping our site and hitting 1 specific action. Once we banned that user, the app has been purring all buttery smooth. However, that’s not really a fix.

We replicated the environment on staging so we could test it. We created a little ruby mechanize script that sends 160 concurrent requests to that page, and when we watch the PIDs we noticed this:

# Before the action
lsof | grep "7 root" | wc -l
21

# After the action
lsof | grep "7 root" | wc -l
333

We updated one of our http handlers to count file descriptors, and it was getting out of hand which led us to believe that there’s some sockets leaking somewhere… just not sure where, or how to really lock in that answer.

open_file_descriptors = `/bin/sh -c "lsof -p #{Process.pid}"`
open_file_descriptor_count = open_file_descriptors.split("\n").size

Maybe you could reduce the app to that particular route, slowly start removing code until you end up with something that you can share so we can reproduce it and try to narrow down the error?

It looks like file descriptor leaking so it must be something we overlooked. Once fixed the app should be working fine again.

3 Likes

That’s a good idea. We’re gonna give that a shot. Thanks! I’ll report back if we find something.

Ok, some good news to report on this. We upgraded to Crystal 0.33.0, and let it sit a night, and it’s held steady. We decided to do our custom load testing, and we can bump it up 5x, and still no crash… So it seems that somewhere between Crystal 0.29 and 0.33 something broke, but then got fixed again?

So for now things are looking good, but if anyone has any insight on what could have caused leaking sockets somewhere in those commits, it might be good to throw some specs in or something (if there’s not already).

It’s only been a day, so I’ll report back if this changes, but initial look is it’s holding great!

5 Likes

Great new!

You could run a git bisect… though I guess it’ll take a few days to find the commit :-)

1 Like