← Back to Monitoring GuidesCloud Platforms

When Fly.io Goes Down: A Survival Guide for Your Team

Your app runs on Fly.io in three regions. A user in Berlin reports the site is down. You check from your office in San Francisco — it loads in 200ms. You tell the user to try again. They do. It's still down. The Frankfurt region had a machine failure 40 minutes ago, and Fly's proxy hasn't rerouted the traffic yet.

What Happens on Your Team

The Full-Stack Developer

Deploys a new version with `fly deploy`. The deploy succeeds in 2 of 3 regions. The third region's machine fails to start due to a missing environment variable that was set in the other regions but not propagated. One-third of users see 502 errors.

The real cost: Multi-region deployments can partially fail. Fly.io's deploy output might show success even when individual machines in specific regions fail to start correctly. Without per-region monitoring, partial deployment failures go unnoticed.

What they should have had: An HTTP monitor on the app's public URL. Even though the monitor checks from one location, it will catch total failures and significant degradation. For regional issues, correlating user reports with monitoring data helps isolate which region is affected.

The DevOps Engineer

Fly.io auto-stops machines after periods of inactivity to save costs. A user hits the app after an idle period and waits 8 seconds for the machine to start. They refresh, wait again, and leave. The DevOps engineer never sees the slow start because they access the app frequently enough to keep machines running.

The real cost: Machine auto-stop is a cost optimization that trades startup latency for savings. If your app has irregular traffic patterns, users who arrive during idle periods get a degraded experience that your team never sees.

What they should have had: Response time monitoring that tracks cold start delays. When a check hits an auto-stopped machine, the response time spikes to 5-10 seconds. That data shows how often cold starts happen and how bad they are — information you need to decide whether to keep machines always running.

The SRE / On-Call Engineer

Gets paged for a production issue. Checks Fly.io status — all green. Checks the app — it's up. Checks logs — no errors. Turns out, the issue was a 15-minute outage in the Singapore region that affected Asian users during their business hours. By the time the SRE checked, the region had recovered.

The real cost: Regional outages are transient and hard to catch after the fact. If nobody was monitoring during the failure window, the only evidence is a gap in traffic from that region — which you'd only notice if you're looking at geographic analytics.

What they should have had: Continuous monitoring with incident history. Even if the outage resolves on its own, the monitoring tool records it — when it started, how long it lasted, and what HTTP status was returned. That data is critical for post-mortems and for deciding whether to add redundancy.

Why Monitor Fly.io?

Fly.io runs your app across multiple regions, which is great for performance — but it also means failures can be regional. Your app might be down in Frankfurt but running fine in Chicago. Without multi-region-aware monitoring, you'd never know half your European users can't reach your service.

What to Monitor

your-app.fly.devYour app's default Fly.io domain
your-custom-domain.comCustom domain pointing to Fly.io
your-app.fly.dev/healthHealth check endpoint

What You Should Actually Do

  1. 1Monitor your app's public URL — either fly.dev or your custom domain — from outside Fly.io's network
  2. 2Track response times to catch cold starts from machine auto-stop — spikes to 5+ seconds indicate machines were sleeping
  3. 3Check your health endpoint, not just the root URL — the proxy can return 200 while your app returns errors
  4. 4Monitor after every deploy — multi-region deploys can partially fail without clear error messages
  5. 5Review incident history weekly to catch transient regional issues that resolved before anyone noticed

Fly.io's Official Status Page

Fly.io publishes real-time status at status.flyio.net. Monitoristic doesn't replace this — it complements it. The official page tells you when Fly.io reports an issue. Your own monitor tells you when your connection is affected, often before the status page updates. You also get push alerts instead of checking a webpage manually.

Fly.io's multi-region architecture is powerful, but it introduces a failure mode most developers aren't used to: regional partial failures. Your app can be up in Chicago and down in Frankfurt simultaneously. External monitoring won't catch every regional issue from a single location, but it catches total failures, cold start delays, and deployment problems — the most common sources of user-facing downtime.

Related Reading

Skip the panic. Know in 60 seconds.

Start Monitoring Fly.io →

Plans from $5/month · 14-day money-back guarantee

Frequently Asked Questions

Does Fly.io have built-in health checks? +
Yes — Fly.io supports internal health checks that determine whether a machine should receive traffic. But these are internal to Fly's proxy layer. They don't alert you when something fails, and they don't provide incident history. External monitoring adds alerts and tracking on top of Fly's internal checks.
How do I monitor Fly.io cold starts? +
Set up an HTTP monitor with response time tracking. When Fly.io auto-stops a machine and a request triggers a restart, the response time will spike from your normal baseline (100-300ms) to several seconds. Monitoring this over time shows how often your users experience cold starts.
Should I monitor my fly.dev domain or my custom domain? +
Monitor your custom domain. This tests the full chain — DNS, SSL, Fly.io's proxy, and your application. If you only monitor the fly.dev domain, you'll miss DNS and SSL issues on your custom domain.
How is this different from status.flyio.net? +
Fly.io's status page reports platform-wide incidents by region. Your monitor checks YOUR specific app. Deployment failures, cold starts, machine crashes, and app-level errors are specific to your deployment and don't appear on the platform status page.

Monitor Other Services