From DNS Failures to Resilience: How NodeLocal DNSCache Saved the Day, Sanu Satyadarshi

Home » News » Mercari India » From DNS Failures to Resilience: How NodeLocal DNSCache Saved the Day, Sanu Satyadarshi

I am Sanu Satyadarshi, part of the Platform Engineering division at Mercari, Inc. Platform Engineering provides a cost-effective, safe, and easy-to-use multi-cloud infrastructure service for all engineering teams to make and scale bets.

This article discusses the DNS-related challenges encountered at Mercari on our Kubernetes clusters and the significant improvements achieved by implementing Node-Local DNS Cache. By optimizing DNS traffic and reducing errors, we enhanced system reliability and scalability, preventing production outages caused by DNS failures.

Key Takeaways

Reduced DNS calls to kube-dns by 10x, decreasing network overhead and inter-service communication costs.
Lowered DNS query rates by 93% for services on the cluster.
Achieved a 10x-100x reduction in DNS-level errors, improving system resilience.
Eliminated the “failed to refresh DNS cache” errors, mitigating a frequent source of incidents.

DNS on Kubernetes: The Elephant in the Room

Domain Name System, more commonly known as DNS an extremely critical component in the internet infrastructure. This is the tech that allows your web browser to find the actual IP address of a website when you type example.com in your browser. DNS in itself is a highly complex topic, and understanding it requires a book(or two) on its own.

Like any network infrastructure, Kubernetes depends on DNS to resolve service names like [service name].[namespace].svc.cluster.local and other names to IPs and allows communications among services and the external world.
From the role of DNS in Kubernetes, you can imagine that any DNS failure or degradation can quickly escalate to increased latency, network congestion, and even complete outages.

On Kubernetes, DNS is installed as a kube-dns deployment running on the kube-system namespace. Specifically at Mercari, it comes pre-installed with our managed GKE clusters for service discovery and name resolution across the clusters.
kube-dns on Kubernetes allows multiple configurations using the configmap that can be used to change various parameters like ndots, etc.

As kube-dns is responsible for resolving all the service queries to IP addresses, scaling the kube-dns pods in response to the number of pods, etc., is the most logical step.
Fortunately, Kubernetes provides kube-dns autoscaling by default to deal with such high-traffic clusters like ours.

Our DNS Challenges

At Mercari, our Kubernetes clusters process extremely high RPS during peak hours, where we started seeing the limitations of kube-dns.

High DNS query rates were overwhelming the kube-dns service.
Frequent DNS-level errors, including NXDOMAIN and truncated responses.
Recurring “failed to refresh DNS cache” errors were causing cache misses.

The final nail in the coffin was a Sev1 incident where multiple services started to fail DNS resolution, leading to timeouts and, eventually, a production outage due to the cascading nature of microservices.

Node-Local DNS Cache: Our Saviour

Previously, for any DNS queries, all the services relied on a few kube-dns pods to resolve the domain names like [service name].[namespace].svc.cluster.local to the IP address of the Service(aka Endpoints).

This setup used to overwhelm the kube-dns pods and caused issues that we talked about in the previous section.

Node-Local-DNS Cache provides a radically different approach to handling DNS queries. Instead of relying on the few `kube-dns` pods, it uses the tried and tested concept of caching at the Kubernetes node level. This allows all the pods on a particular node to use the DNS cache on that node before reaching out to the kube-dns pods
Source: kubernetes.io

This provides multiple benefits:

Localized DNS resolution, reducing inter-node traffic.
High scalability of the cluster during peak business hours.
Reduction of load on kube-dns, thus providing resiliency against kube-dns failures

Implementation

Once we identified the solution, we started planning the rollout strategy for node-local-dns-cache across all our environments.

To do a gradual rollout and reduce the blast radius, we deployed the NodeLocal DNSCache on our Laboratory GKE Cluster(which is only used by the Platform Teams for internal testing) with a specific nodeAffinity. This allowed us to safely measure the impact of NodeLocal DNSCache without impacting all the workloads.

Based on our learnings, we decided to gradually roll out NodeLocal DNSCache across all our Dev and Prod environments by adding labels on the node pools to allow NodeLocal DNSCache pods to be deployed.

Impact and Results

The results were unbelievable.

10x reduction in DNS calls to kube-dns.
A 10x to 100x reduction in DNS-level errors depending on the class of error (e.g., 10x for nxdomain, 100x for truncated)
100% elimination of “failed to refresh DNS cache” errors, which were responsible for many production incidents.
Significant improvement in cluster scalability and network efficiency.
DNS Error count before and after the rollout

DNS Query rate before and after the rollout

Conclusion

Implementing Node-Local DNS Cache addressed our DNS challenges, resulting in a 10x reduction in DNS traffic, fewer errors, and enhanced system reliability. These improvements underscore the importance of optimizing DNS in Kubernetes clusters, especially for high-traffic environments like ours. By sharing our experience, we hope to guide others in enhancing their DNS operations and achieving similar results.

I would like to thank Yusaku Hatanaka (hatappi) and Tarun Duhan for their valuable inputs and contributions during the implementation.

Blogs & News

BLOG & NEWS