K8s DNS fail | Srijan Shukla

Very early career, judge me less for this one.

Published: Jan 2023. This was my running notes on debugging intermittent DNS failures in Kubernetes. I later learned the main root causes often relate to conntrack exhaustion, kube-proxy subtleties, or musl libc DNS resolver quirks. I keep this here as an exploration record.

Let the service name be ‘banana’

Let the pods be of the service ‘pineapple’

Incident #1

There was a rise in 502s accompanied by cannot lookup banana and interestingly only one of the pods were not able to perform this lookup, this lookup error gets resolved on its own in sometime, overall only causing a blip.

There are no other panels that were found that could correlate with this, just a rise in the logging. And also the IP address in the logs did not change at all - this IP belongs to one of the two pods, it seems that only this one pod faced a DNS resolution issue for the banana service

Need to go through the following steps next,

https://aws.amazon.com/premiumsupport/knowledge-center/eks-dns-failure/

Incident #2

Exact same signature as the Incident #2.

And again, exact same pod

Incident #3 and #4

Exact same all over again. In total there were thousands of error lines for pod being unable to resolve DNS for the banana k8s service

But if we try to find unique lines across them, there are only 2 unique lines, corresponding to the two different times at which the blip occurred.

time=2023-01-10 20:45:46 level=error msg=http: proxy error: dial tcp: lookup svc.ns.svc.cluster.local on 172.20.0.10:53: read udp 10.50.133.117:60450->172.20.0.10:53: i/o timeout

time=2023-01-11 02:13:09 level=error msg=http: proxy error: dial tcp: lookup svc.ns.svc.cluster.local on 172.20.0.10:53: read udp 10.50.130.253:47882->172.20.0.10:53: i/o timeout

This means:

DNS lookup fails for ≤ 1 second
DNS lookup does not necessarily fail for the same pod everytime
- But, it does fail for only one pod at a time
- It has now failed for both the pods at different points of time - and the two pods do not reside in the same node. Therefore the problem is not node related.
The IP address 172.20.0.10 belongs the kubernetes service for coredns
- There are 2 coredns pods, they reside in two different nodes
- They both have 0 restarts

Also, here is the Corefile in use,

.:53 {
    errors
    health
    kubernetes cluster.local in-addr.arpa ip6.arpa {
      pods insecure
      fallthrough in-addr.arpa ip6.arpa
    }
    prometheus :9153
    forward . /etc/resolv.conf
    cache 30
    loop
    reload
    loadbalance
}

Corefile does mention error which means it should be logging errors if the request reached the pods.

Where can the problem be:

The request is not reaching coredns pods.
- kubernetes service - the coredns service, not the banana service
The request is reaching the coredns pods but it is not responding in time - resulting in a timeout
- This is less probable as there are no error logs for coredns on the coredns pods
- Check coredns metrics
  - Nothing absurd here - in general and specifically about the timeranges where the availability dips
- Request is likely not reaching coredns, it is the kubernetes service that is intermittently failing.
  - traceroute doesn’t quite work once it it hits the NAT gateway? because no ICMP packets are coming back, because security groups?
- https://www.alibabacloud.com/help/en/container-service-for-kubernetes/latest/troubleshooting-dns-troubleshooting
- Troubleshoot DNS failures with Amazon EKS
- Linux slow dns lookup (delay = 5 seconds)
Ways:
- We need to check how much conntrack is getting used.
- Change base image to ubuntu
- Increase coredns pods? Not required. Only 800 DNS requests arrive for CoreDNS.
  - No container restarts
To be continued, dumping all things here:
- Add a dot at the end. To circumvent DNS.?
- Alpine issues
  - Use ubuntu or debian as base image.
- There are A and AAAA queries both at the same time
  - Alibaba blog
All the links: