Hi,
Help me please figure out the reason for the strange behavior of the app. One of my apps have recently the memory leak. As the investigation showed, this has been happening for a long time but after crash systemd restart it very fast and monitoring does not have time to register the crash and them to inform me. The memory leak looks like as though if his memory someone was limited but is not true because still remains about 2Gb. Log show the common message without additional info:
[...]
Unhandled exception in spawn: Error while flushing data to the client (HTTP::Server::ClientEr
from ???
Caused by: Error writing to socket: Broken pipe (IO::Error)
from ???
Unhandled exception in spawn: Error while flushing data to the client (HTTP::Server::ClientEr
[...]
System logs contain a bit more information:
Aug 29 11:50:03 host-02 hermes[3673]: mmap(PROT_NONE) failed
Aug 29 11:50:36 host-02 systemd[1]: dispatcher.service: Main process exited, code=dumped, status=6/ABRT
Aug 29 11:50:36 host-02 systemd[1]: dispatcher.service: Failed with result 'core-dump'.
Aug 29 11:50:41 host-02 systemd[1]: dispatcher.service: Service hold-off time over, scheduling restart.
Aug 29 11:50:41 host-02 systemd[1]: dispatcher.service: Scheduled restart job, restart counter is at 1.
Aug 29 11:50:41 host-02 systemd[1]: Stopped dispatcher daemon.
Aug 29 11:50:41 host-02 systemd[1]: Started dispatcher daemon.
Aug 29 15:22:12 host-02 dispatcher[9226]: GC Warning: Repeated allocation of very large block (appr. size 2625536):
Aug 29 15:22:12 host-02 dispatcher[9226]: May lead to memory leak and poor performance
Aug 29 17:18:29 host-02 dispatcher[9226]: GC Warning: Repeated allocation of very large block (appr. size 1052672):
Aug 29 17:18:29 host-02 dispatcher[9226]: May lead to memory leak and poor performance
Aug 29 18:00:05 host-02 dispatcher[9226]: mmap(PROT_NONE) failed
Aug 29 18:01:10 host-02 systemd[1]: dispatcher.service: Main process exited, code=dumped, status=6/ABRT
Aug 29 18:01:10 host-02 systemd[1]: dispatcher.service: Failed with result 'core-dump'.
Aug 29 18:01:15 host-02 systemd[1]: dispatcher.service: Service hold-off time over, scheduling restart.
Aug 29 18:01:15 host-02 systemd[1]: dispatcher.service: Scheduled restart job, restart counter is at 2.
Aug 29 18:01:15 host-02 systemd[1]: Stopped dispatcher daemon.
Aug 29 18:01:15 host-02 systemd[1]: Started dispatcher daemon.
I drew attention to the mmap and went to look at the information smaps. It show about 10K similar frames.
7f0f7a98e000-7f0f7a98f000 ---p 00000000 00:00 0 $
Size: 4 kB$
KernelPageSize: 4 kB$
MMUPageSize: 4 kB$
Rss: 0 kB$
Pss: 0 kB$
Shared_Clean: 0 kB$
Shared_Dirty: 0 kB$
Private_Clean: 0 kB$
Private_Dirty: 0 kB$
Referenced: 0 kB$
Anonymous: 0 kB$
LazyFree: 0 kB$
AnonHugePages: 0 kB$
ShmemPmdMapped: 0 kB$
Shared_Hugetlb: 0 kB$
Private_Hugetlb: 0 kB$
Swap: 0 kB$
SwapPss: 0 kB$
Locked: 0 kB$
VmFlags: mr mw me ac sd nh $
And their number was constantly increasing:
Sun, 30 Aug 2020 23:25:01 +0000 | 23702
Sun, 30 Aug 2020 23:30:02 +0000 | 24020
Sun, 30 Aug 2020 23:35:01 +0000 | 24318
Sun, 30 Aug 2020 23:40:01 +0000 | 24552
Sun, 30 Aug 2020 23:45:01 +0000 | 24804
Sun, 30 Aug 2020 23:50:01 +0000 | 25068
Sun, 30 Aug 2020 23:55:01 +0000 | 26404
Mon, 31 Aug 2020 00:00:02 +0000 | 28360
Mon, 31 Aug 2020 00:05:01 +0000 | 31105
Mon, 31 Aug 2020 00:10:01 +0000 | 33323
Mon, 31 Aug 2020 00:15:01 +0000 | 34014
I’m attach logger and found that max number of frames 65536 and then the app crashed. 65536 it’s the same like max port number. Probably it could be due to the number of connections because the app has permanent connections with three remote servers and handles incoming connections by HTTP. OK, if the app out of limit of connections then I should see too many opened sockets? No. netstat, ss, lsof and other utils show the same count about 100-150 connections and this it doesn’t look like at 65536.
It turns out that the application picks up the maximum number of pages in memory, but it does not consume tags. I have been programming only about 4 years old (and even then only in free time) and a lot of things I do not know, but in this case I do not understand where you can still look next for the cause. Just in case, I ran a strace to see what was happening inside the application and saw interesting points:
strace: ptrace_setoptions = 0x51
strace: attach to pid 23439 (main) succeeded
strace: Process 23439 attached
strace: [wait(0x80057f) = 23439] WIFSTOPPED,sig=SIGTRAP,EVENT_STOP (128)
strace: pid 23439 has TCB_STARTUP, initializing it
strace: [wait(0x00857f) = 23439] WIFSTOPPED,sig=133
[00007f0f90207d67] epoll_wait(9<anon_inode:[eventpoll]>, strace: [wait(0x00857f) = 23439] WIFSTOPPED,sig=133
[], 32, 214) = 0
strace: build_mmap_cache: tgen=0, ggen=0, tcp=0x557bf4e453d0, cache=0x557bf4e52940
strace: unwind_print_stacktrace: walk: tcp=0x557bf4e453d0, queue=(nil)
> /lib/x86_64-linux-gnu/libc-2.27.so(epoll_wait+0x57) [0x121d67]
> /usr/lib/x86_64-linux-gnu/libevent-2.1.so.6.0.2(event_strlcpy_+0x12c5) [0x292c5]
> /usr/lib/x86_64-linux-gnu/libevent-2.1.so.6.0.2(event_base_loop+0x314) [0x1f114]
> /var/www/dispatcher/releases/20200723024901/dispatcher(*Crystal::Scheduler#reschedule:Nil+0x3d) [0x1b46d]
> /var/www/dispatcher/releases/20200723024901/dispatcher(*TCPSocket+@Socket#unbuffered_read<Slice(UInt8)>:Int32+0x1d6) [0x8c726]
> /var/www/dispatcher/releases/20200723024901/dispatcher(*TCPSocket+@IO::Buffered#read<Slice(UInt8)>:Int32+0xe6) [0x8c496]
> /var/www/dispatcher/releases/20200723024901/dispatcher(~procProc(Nil)@lib/tarantool/src/tarantool/connection.cr:94+0x1ef) [0x9261f]
> /var/www/dispatcher/releases/20200723024901/dispatcher(*Fiber#run:(IO::FileDescriptor | Nil)+0x2c) [0x35bac]
strace: [wait(0x00857f) = 23439] WIFSTOPPED,sig=133
[00007f0f90d5d9be] sendto(8<socket:[544960]>, "\xce\x00\x00\x00\x08\x82\x00\x40\x01\xcd\x31\xd2\xc0", 13, 0, NULL, 0strace: [wait(0x00857f) = 23439] WIFSTOPPED,sig=133
) = 13
strace: unwind_print_stacktrace: walk: tcp=0x557bf4e453d0, queue=(nil)
> /lib/x86_64-linux-gnu/libpthread-2.27.so(send+0x5e) [0x119be]
> /var/www/dispatcher/releases/20200723024901/dispatcher(*TCPSocket+@Socket#send<Slice(UInt8)>:Int32+0x1c) [0xa5a2c]
> /var/www/dispatcher/releases/20200723024901/dispatcher(*Tarantool::Connection@Tarantool::Connection::Requests#ping:Time::Span+0x462) [0xa2222]
> /var/www/dispatcher/releases/20200723024901/dispatcher(~procProc(Nil)@lib/tarantool/src/tarantool/connection.cr:106+0x48) [0xa1ca8]
> /var/www/dispatcher/releases/20200723024901/dispatcher(*Fiber#run:(IO::FileDescriptor | Nil)+0x2c) [0x35bac]
strace: [wait(0x00857f) = 23439] WIFSTOPPED,sig=133
[00007f0f90207d67] epoll_wait(9<anon_inode:[eventpoll]>, strace: [wait(0x00857f) = 23439] WIFSTOPPED,sig=133
[{EPOLLIN, {u32=8, u64=8}}], 32, 66) = 1
Here you can see that constantly used only three descriptors and this explains the small number of created sockets. Code by path (I have a bit modified version on private zone but not in this part) is the same more than one years and probably worked well until this moment: https://github.com/vladfaust/tarantool.cr/blob/master/src/tarantool/connection.cr#L90.
In this regard, I have a hypothesis that the created fibers do not die after transfer data to remote server and still to occupy place in memory. The next thing I did was check the correctness of the fiber implementation. There are no one problem here, they work correctly. My code also does not interfere with this logic in any way, but simply transfers data over the connection object.
What can cause this app behavior?