The Cloudflare DNS Outage: A Config Change That Took Down Millions of Sites

One day in 2025, millions of websites went offline for roughly an hour. No ransomware. No DDoS. No nation-state threat actor. A configuration change in Cloudflare's infrastructure propagated globally, took down their DNS resolver (1.1.1.1), and your site's uptime depended on a change you had zero visibility into.

That's the cloudflare dns outage story — and the real lesson isn't about Cloudflare. It's about how developers build systems that depend on a single vendor for everything.

What actually happened

Cloudflare pushed a routing configuration change — related to BGP (Border Gateway Protocol), the system that governs how traffic is directed between networks — that didn't behave as expected.

BGP misconfigurations are notoriously hard to catch before they go live. A rule propagates globally in minutes. By the time you realize something is wrong, every anycast DNS node is already affected. Recovery isn't flipping a switch — it's rolling back a change that's already spread across hundreds of points of presence worldwide.

The result: 1.1.1.1, Cloudflare's DNS resolver, became unreachable for a significant portion of users. Requests that should have resolved a domain to an IP address just... didn't. Sites that relied entirely on Cloudflare's infrastructure — DNS, CDN, WAF, all of it — went dark simultaneously.

Recovery took roughly 60-90 minutes. For most sites, that's between $1,000 and $50,000 in lost revenue, depending on scale. For a solo developer whose side project just happened to be featured on Hacker News that day, it's a different kind of painful.

Why this hits differently than a hack

When a hacker takes you down, you have a clear adversary. You know what happened. There's a response playbook: isolate, assess, patch, communicate.

When your vendor's own engineer causes the outage, none of your defenses matter. Your WAF didn't fail. Your security posture didn't fail. Your DDoS protection was fine. A completely internal change at a third party company took you offline and you had no warning, no control, and no recourse except waiting.

This is the operational risk category that developers rarely model. We spend a lot of time thinking about what an attacker might do to our systems. We spend almost no time thinking about what our vendors might accidentally do to our systems.

The threat model is different, but the impact is identical.

The single-vendor trap

Cloudflare's appeal is obvious. You point your nameservers there, enable the proxy, turn on the WAF, add DDoS protection, configure caching rules — and suddenly your $7/month VPS is hiding behind enterprise infrastructure. The integration is seamless, the docs are excellent, and the free tier is genuinely generous.

The trap is that you've now routed your domain, DNS resolution, CDN layer, DDoS protection, and firewall through a single vendor. One blast radius. When Cloudflare has an outage, everything goes down simultaneously — not as separate failures you might be able to route around, but as a single correlated event.

Many developers don't even realize how deep this dependency goes until something breaks.

DNS resilience: what your options actually are

You don't have to abandon Cloudflare. You do have to stop treating any single vendor as a hard dependency.

Use a secondary DNS provider

Most domain registrars let you specify multiple authoritative nameservers. When your primary DNS is Cloudflare, adding a secondary from a different provider — AWS Route 53, NS1, or Hetzner — means that if Cloudflare's nameservers are unreachable, resolvers can fall back to the secondary.

Here's what that looks like in a typical Terraform setup with Route 53 as the secondary:

# Cloudflare as primary, Route 53 as secondary
# Both serve identical zone records
 
resource "aws_route53_zone" "secondary" {
  name    = "yourdomain.com"
  comment = "Secondary DNS — fallback if Cloudflare is unreachable"
}
 
resource "aws_route53_record" "apex" {
  zone_id = aws_route53_zone.secondary.zone_id
  name    = "yourdomain.com"
  type    = "A"
  ttl     = 3600
  records = ["your.origin.ip.address"] # GOOD: direct to origin, bypasses CF proxy
}

You then add both the Cloudflare nameservers and the Route 53 nameservers to your domain registrar. Both serve the same records. When one is unavailable, DNS resolvers use the other.

The catch: the Route 53 records point directly to your origin, bypassing the Cloudflare proxy. That's the point — it's a fallback, not a parallel CDN. You lose caching and WAF protection during a Cloudflare outage, but your site stays up.

TTL strategy: low before changes, high in steady state

TTL (Time To Live) is how long DNS resolvers cache a record before asking for a fresh one. Counterintuitively, most developers set it and forget it.

The right approach:

Steady state: Use a TTL of 3600 seconds (1 hour) or higher. This means most of your users are being served from cache, so a brief outage at your DNS provider doesn't immediately affect them. Records stay cached.
Before any infrastructure change: Drop TTL to 60 seconds for about 2x the current TTL in advance. This ensures you can roll back fast if something goes wrong. Once the change is confirmed stable, raise it back.

# Check your current TTL
dig yourdomain.com +noall +answer
 
# Example output when TTL is too low (300s) — you're at risk
# yourdomain.com.  300  IN  A  104.21.x.x

A 300-second TTL in steady state means a Cloudflare outage immediately propagates to your users. A 3600-second TTL means 60 minutes of cached resolution before anyone even asks Cloudflare for a fresh answer — which is often longer than the outage itself.

Health checks and automatic failover

AWS Route 53 health checks can automatically switch DNS traffic if your primary endpoint fails. Cloudflare has a similar feature (Load Balancing) on paid plans.

# AWS CLI: create a health check targeting your origin
aws route53 create-health-check \
  --caller-reference "$(date +%s)" \
  --health-check-config '{
    "IPAddress": "your.origin.ip",
    "Port": 443,
    "Type": "HTTPS",
    "ResourcePath": "/health",
    "FullyQualifiedDomainName": "yourdomain.com",
    "RequestInterval": 30,
    "FailureThreshold": 3
  }'

Your /health endpoint should return a fast 200 OK with minimal logic — just enough to confirm the app is running. Don't put database queries in it. A health check endpoint that times out under load defeats the purpose.

What indie developers can realistically do

Enterprise multi-CDN strategies cost real money. A three-CDN setup with global load balancing and automatic failover is not a weekend project. But if you're a solo developer or a small team, there are things you can do today for free or close to it.

Know your single points of failure. Write them down.

This sounds obvious, but most developers don't actually do it. Open a doc and list every vendor your site depends on to stay online. Cloudflare, your registrar, your hosting provider, your database host, your auth provider. For each one, ask: if this goes down for an hour, what happens?

Make sure your origin is directly accessible.

If you're on Vercel, Railway, Fly.io, or any comparable platform, your app has a direct URL that bypasses Cloudflare. Find it and confirm it works. Save that URL somewhere you can access without Cloudflare. If Cloudflare's DNS is down but your Vercel deployment URL still works, you can at least point people there manually while you wait for recovery.

Set up independent uptime monitoring.

UptimeRobot and Better Uptime both have free tiers that ping your site every 60 seconds and alert you when it goes down. The key word is independent — your uptime monitor should not be running on Cloudflare infrastructure. Point it at your origin IP, not your Cloudflare-proxied domain.

Store your credentials somewhere accessible.

Here's the painful irony: if your password manager uses Cloudflare DNS, you can't log into it during a Cloudflare outage. And if you can't log in, you can't access your Cloudflare credentials to fix anything. Keep a local backup of critical credentials in an encrypted file on your machine. Not in the cloud. On your machine.

The real lesson: misconfiguration is the attack surface

Look at the major outages from 2024-2025. GitLab accidentally deleted a production database. AWS us-east-1 went down due to an S3 configuration change. CrowdStrike caused a global Windows outage with a bad content update. Cloudflare took down DNS with a BGP routing change.

None of these were hacks. All of them were internal configuration changes that went wrong.

The pattern is consistent enough to be worth naming: misconfiguration is now a more common cause of outages than external attacks. OWASP lists Security Misconfiguration as the fifth most critical web application risk — and that's just misconfiguration in application code, not infrastructure.

The implication for how you work: a change to your nginx.conf, your GitHub Actions workflow, your Dockerfile, or your Cloudflare routing rules deserves the same review process as a change to your application code. If you wouldn't merge a code change without a PR and a review, the same discipline should apply to infrastructure.

Config is code. Treat it that way.

If you want to check whether your own infrastructure configuration files — Dockerfiles, CI/CD workflows, server configs — contain misconfigurations that could cause exactly this kind of operational failure, that's what Data Hogo scans for alongside code vulnerabilities. A misconfigured Dockerfile or a GitHub Actions workflow with hardcoded credentials is a security issue before it's an operational one.

TL;DR

The Cloudflare DNS outage in 2025 was caused by a BGP routing misconfiguration, not a cyberattack. Recovery took 60-90 minutes.
When you route DNS, CDN, WAF, and DDoS protection through one vendor, you have a single blast radius. One internal change can take all of it down at once.
Secondary DNS from a different provider is the most effective mitigation. Your registrar can list nameservers from both Cloudflare and Route 53 — resolvers use whichever responds.
TTL strategy matters: high TTL (3600s+) in steady state means cached records survive short outages. Low TTL (60s) before planned changes means you can roll back fast.
Know your origin URL — your Vercel/Railway/Fly.io direct URL bypasses Cloudflare and stays accessible even when Cloudflare DNS is down.
Set up independent uptime monitoring (UptimeRobot free tier, pinging every 60 seconds) on infrastructure that doesn't depend on Cloudflare.
The broader lesson: misconfiguration caused more major outages in 2024-2025 than external attacks. Treat infrastructure changes with the same review discipline as code changes.

FAQ

Was the Cloudflare outage a cyberattack?

No. The Cloudflare DNS outage was caused by an internal misconfiguration — a routing or configuration change that propagated incorrectly. No malicious actor was involved. Recovery required Cloudflare engineers to roll back the change.

How can I protect my site from a Cloudflare outage?

You can't prevent Cloudflare from having outages, but you can reduce your dependency on a single point of failure: use a secondary DNS provider with a different provider than your CDN, monitor your uptime independently, and make sure your origin server is directly accessible if needed.

Is using multiple DNS providers complicated to set up?

For basic redundancy, not really. You configure your domain registrar to list name servers from two different providers. Both providers serve the same DNS records. When one is down, resolvers fall back to the other.

The Cloudflare Outage Wasn't a Hack — It Was a Config Change. Here's What Developers Should Do.