The Bottleneck of DNS

Vishnu MohanVishnu Mohan
3 min read

Most software products and organisations ensure reliability by keeping their services and resources running 24×7. However, DNS introduces a unique bottleneck: many companies outsource DNS to providers like Akamai or Cloudflare, giving them little to no direct control. If the DNS provider experiences an outage, the service becomes unreachable even if the application itself is fully operational.

  • DNS is an external dependency for most companies because they delegate it to managed providers (Akamai, Cloudflare, AWS Route 53, Google Cloud DNS, etc.).

  • If the provider’s infrastructure fails or misconfigures records, your domain may become unreachable even if your backend is 100% healthy.

  • This also means without dedicated DNS monitoring, such outages will go unnoticed by IT until it impacts the business.

Why it’s a bottleneck

  • Single point of failure – If your DNS is hosted entirely by one provider, their outage blocks users from even resolving your domain name.

  • Control limitations – You can’t directly fix the provider’s outage. You’re dependent on their recovery process.

  • Propagation delays – Even if you switch providers during an outage, DNS changes take time to propagate across the internet (TTL delays).

However, there are some nuances and ways companies mitigate this bottleneck

How companies reduce the risk

  1. Multi-DNS provider setup

    • Some organizations host their authoritative DNS zones with two independent providers.

    • Example: Cloudflare + NS1, or Route 53 + Akamai.

    • If one provider goes down, resolvers can still query the other.

    • This requires both to have the same records kept in sync (often automated).

  2. Short TTLs (Time To Live)

    • Keep DNS record TTL low (like 60–300 seconds) so switching providers or endpoints is faster.

    • This helps in disaster recovery but increases query load on DNS servers.

  3. Anycast + Global Load Balancing

    • Big DNS providers themselves use Anycast networks so that even if one region is down, queries go to the next closest healthy node.
  4. Failover within the provider

    • Many DNS services offer health-check based failover so if one endpoint is down, DNS returns an alternative IP.
  5. Local DNS Caching & Hardcoding for Critical Internal Services

    • For internal systems, companies sometimes hardcode IPs in /etc/hosts or internal resolvers to bypass DNS temporarily.

This short article is inspired from a youtube video from Hussein Nasser which got me thinking about this problem and how companies mitigate this.

The real incident happened in July 2021, Akamai’s DNS services had a major outage disrupting access to numerous websites and applications. While such outages are rare still many where impacted and those companies had no secondary DNS, so even though their applications were fine, users couldn’t reach them.

0
Subscribe to my newsletter

Read articles from Vishnu Mohan directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Vishnu Mohan
Vishnu Mohan

Software Engineer | DevOps & CloudNative