There isn't a bug, it's resource exhaustion. You open a bunch of files and they fail to close. You don't log errors on the close, so you have no idea it's happening. Now your app is failing to open new file descriptors to accept HTTP connections. You get a fixed number of fds per app; ulimit -n. If you don't close files you've read, the descriptor is gone.
The bug in this case is in the filesystem that hangs on close. It happens on network filesystems. You can't return the fd to the kernel if your filesystem doesn't let you.
The bug of which we speak is in that your app is crashing. Exhausting open file handles is expected behaviour! Expected behaviour should not lead to a crash. Crashing is only for exceptional behaviour.
The filesystem hanging is unlikely to be a bug. The filesystems you'd realistically use in conjunction with Kubernetes are pretty heavily tested. More likely it is supposed to hang under whatever conditions has lead that to happen.
And, sure, maybe you'll eventually want to determine why the filesystem has moved into that failure state, but most pressing is that your app is crashing. All that work you put into gracefully handling the failing situation going to waste.
Kubernetes is really here nor there. It's the crashing of the app that is our focus. An app should not be crashing on expected behaviour.
That's clearly a bug, and the bug you need to fix first so that you can have your failsafes start working again. You asked where to start and that's the answer, unquestionably.
The app doesn't crash, it's deadlocked. It can't do any more work because to do future work it needs to accept TCP connections. It can't do that because it has hit a resource limit. It hit the resource limit because it didn't correctly close files. It can't close files because of a bug in the filesystem. You don't know this because you didn't log the errors.
I really don't know how I can make my explanation simpler.
> I really don't know how I can make my explanation simpler.
Not making up some elaborate story that you now are trying to say didn't even happen would be a good start. What you are actually trying to communicate is not complicated at all. It didn't need a story. Not sure what you were thinking when you decided fiction writing was a good idea, but I certainly had fun making fun of you for it! So, at least it was not all for not.
The bug in this case is in the filesystem that hangs on close. It happens on network filesystems. You can't return the fd to the kernel if your filesystem doesn't let you.