Recently I experienced a few power failures that lasted hours. This means that when the power is back, all of my infrastructure reboots and reconnects. For the most part this is 100% automatic, but the last time I ran into an interesting problem.
My pi-hole was running with the default rate limiting of 1000/60. This means that each device can make up to 1000 requests per minute, and if it exceeds that it will be put on a deny list for 60 seconds.
It turns out that my main server that runs a bunch of docker containers makes a lot of DNS requests when everything is starting up all at once. This creates a storm of requests to the pi-hole and the server ends up being blocked for DNS requests (responding with
REFUSED) due to rate limiting.
Unfortunately the behaviour of enough of the containers is to retry when this happens. This causes more DNS requests to be made as the retry logic runs. These retries cause another wave of requests which cause the server to be blocked again. Some of my containers entered error conditions due to unexpected DNS failures, so these needed to later be restarted but at least they stopped contributing to the problem.
My email container was pretty unhappy, it really wants to be able to use DNS, even when receiving email. Since my server had been unavailable for a while, there were external email servers trying to deliver mail that had been queued – this contributed to the load. Additionally I couldn’t connect any email clients to the server which left me scratching my head a little, more on that later on.
The ‘fix’ was easy enough. Modify the pi-hole DNS rate-limiting setting to 0/0 to remove any rate limiting. This is imperfect, but at one point I saw 30,000 requests in a minute from my struggling server and I think I’d rather have no limit and deal with that problem than hit the limit and run into this denial of service issue.
Now that the pi-hole was happy, I was able to get most of my containers to be happy with a little poking at them. Email was still sad, and this took me a coffee break to realize what was wrong. The email container was receiving email just fine, but I could not connect with a client. This felt like a networking problem, but how could that be?
I had forgotten (again) – that the email server has fail2ban running in it. This scans logs looking for suspicious activity and will ban an IP for a period of time by inserting a firewall rule. Furthermore, as I use the domain name to configure my email client – this resolves to the external IP. The external IP means that the client talks to my OpenWRT router which provides NAT and then redirects/maps that external IP back into my network. This has the effect that the originating IP looks like it is my router, not the client machine on the internal IP address. This process is called NAT reflection, or NAT hairpinning.
While NAT reflection is a super handy feature for my OpenWRT router to have, allowing me to easily from inside my home network visit a machine I’ve exposed via port mapping to the outside world using the same DNS entry that points at the external IP address — it means that services on that machine see my router IP as the client IP. When any of the machines in my house have problems connecting to my email server, in this case because I had DNS
REFUSED errors on the email server, fail2ban decides that is a bad client and bans it. Thus banning all traffic originating from my home network.
This is easy to fix once you understand what is happening, I just needed to unban my router IP and my email clients could connect.