How do I know if my website is down for everyone or just me?

Check from multiple sources. Use your uptime monitoring tool's dashboard, try accessing the site from your phone (on mobile data, not WiFi), and ask a colleague in a different location to check. If your monitor shows downtime, it's not just you.

Should I tell my users when my site is down?

Yes, always. Silence during an outage is worse than the outage itself. Post a brief update on your status page, social media, or support channels. Users forgive downtime. They don't forgive being left in the dark.

How long should I wait before escalating an outage?

Don't wait. If your first check confirms the site is down and a quick restart or hosting check doesn't fix it in 5 minutes, escalate immediately. Contact your hosting provider, check third-party service status pages, and loop in anyone who can help. Early escalation costs nothing — late escalation costs everything.

What should I do after my site comes back up?

Verify full functionality, not just the homepage. Check login, checkout, API endpoints, and any third-party integrations. Update your status page to 'resolved.' Then within 24 hours, do a brief post-mortem: what happened, when it was detected, how it was fixed, and what you'll do to prevent it next time.

What to Do When Your Website Goes Down: A Step-by-Step Checklist

Your monitoring alert fires. Your site is down.

The next 10 minutes determine whether this is a minor blip or a reputation-damaging incident. Not because of the technical fix — most outages resolve themselves or require a simple restart. The damage comes from how you respond: how fast you confirm, how clearly you communicate, and whether you learn anything from it afterward.

Here's the checklist. Bookmark it. You'll need it at 2 AM when your brain isn't at its best.

Step 1: Confirm the Outage (1 minute)

Before you do anything, confirm the site is actually down — not just slow, not just broken for you, not just a DNS cache issue on your machine.

Check these in order:

Open your monitoring dashboard. If your uptime monitor shows the site as down with a timestamp, that's confirmation.
Try accessing the site from your phone on mobile data (not WiFi) — this rules out local network issues.
Check a different page or endpoint — the homepage might be down while the API is fine, or vice versa.

What you're looking for: Is it completely unreachable (connection timeout, DNS failure) or partially broken (500 errors, slow loading, specific pages failing)? This distinction determines your next step.

Don't skip this. People waste 15 minutes troubleshooting their own network when the site is actually fine, or miss that only one endpoint is failing while the rest works.

Step 2: Check the Obvious Causes (2–3 minutes)

Most outages have simple causes. Check these before diving into logs:

Hosting provider:

Is your server/droplet/instance actually running? Check your provider's dashboard.
Did you run out of resources (disk, memory, CPU)? Provider dashboards usually show this.
Check your hosting provider's status page — if it's a platform-wide issue, there's nothing you can fix on your end.

Recent changes:

Did anyone deploy code in the last hour? A bad deploy is the most common cause of sudden outages.
Did any scheduled tasks run? Database migrations, cron jobs, and backup processes can take down a site.
Did you update any environment variables, DNS records, or SSL certificates recently?

Third-party services:

Is your database accessible? Check Supabase, Firebase, or your managed database provider's status.
Is your CDN/edge network healthy? Check Cloudflare, Netlify, or Vercel.
Is your auth provider responding? A broken auth service blocks all logged-in functionality.

If you find the cause here, skip to Step 4 (Fix It). If not, continue.

Step 3: Communicate Immediately (Don't Wait)

This is where most teams fail. They stay silent, hoping to fix it quickly and pretend it didn't happen. That never works.

Right now, before you fix anything:

Update your status page: If you have a public status page, mark the affected service as "Investigating." This takes 30 seconds and saves you from answering 50 individual support messages.
Post in your team channel: "Site is down, investigating. Will update in 10 minutes." This prevents five people from independently discovering the same problem.
Prepare a user-facing message: Keep it simple. "We're aware of an issue affecting service. We're investigating and will update shortly." No jargon, no blame, no speculation about the cause.

What NOT to communicate:

Don't say "We'll be back in 5 minutes" unless you're certain
Don't blame a vendor by name in real-time
Don't share technical details publicly (save that for the post-mortem)

Step 4: Fix It

The fix depends on what you found in Step 2. Here are the most common scenarios:

Bad deploy → Roll back: If the outage started immediately after a deployment, roll back first, investigate second. A rollback takes minutes. Debugging a broken deploy while the site is down takes longer and costs more.

Server crash → Restart: If your application process crashed (OOM kill, unhandled exception), restart it. Check logs after the restart to understand why it crashed.

Resource exhaustion → Free resources: Full disk? Clear old logs and temp files. Out of memory? Restart and set up alerts. Database connections maxed? Kill idle connections and check for leaks.

Hosting/DNS issue → Contact provider: If it's your hosting provider or DNS provider, contact their support and monitor their status page. There's nothing you can fix on your end for infrastructure-level issues.

SSL certificate expired → Renew: This happens more often than anyone admits. Renew the certificate and set a calendar reminder for next time.

Unknown cause → Gather evidence: If nothing obvious is wrong, start collecting data: server logs, application logs, error rates, response time trends. Check if the issue is intermittent or continuous. Sometimes the best move is to restart everything and analyze logs after the service is restored.

Step 5: Verify Recovery (Don't Trust One Check)

Your site responds once. Is it actually back?

Verify properly:

Check your monitoring dashboard — wait for at least 2–3 consecutive successful checks, not just one.
Test the full user flow — homepage, login, core functionality, checkout if applicable.
Check response times — the site might be "up" but responding in 10 seconds. That's functionally broken.
Verify from multiple locations — if you're only testing from your own machine, you might have a cached version.

Update your communications:

Status page → "Resolved" with a brief note
Team channel → "Site is back, verified, monitoring closely"
If you posted on social media → follow up with a resolution message

Step 6: Post-Mortem (Within 24 Hours)

Don't skip this. If you skip this, the same outage will happen again and you'll follow the same checklist from scratch.

A post-mortem doesn't need to be a formal document. Answer five questions:

What happened? One-sentence description of the incident.
When did it start and end? Use timestamps from your uptime monitor. Don't guess.
How was it detected? Monitoring alert? User report? Accidentally noticed?
What was the fix? What specifically resolved the issue?
What will prevent this next time? This is the only question that matters long-term. Be specific — "we'll be more careful" is not a prevention measure.

Common prevention measures:

Set up a monitoring alert if you didn't have one
Add a health check endpoint that tests more than just the homepage
Tighten your check interval on critical endpoints
Add a deployment health gate (don't route traffic until the new version responds)
Set up webhook alerts to notify your team channel automatically
Set up disk space / memory alerts with your hosting provider

The Checklist (Copy This)

When your site goes down:

Confirm the outage (monitoring dashboard, mobile data, different endpoint)
Identify the type (full outage, partial, specific endpoint)
Check hosting provider status
Check for recent deploys or changes
Check third-party service status pages
Update your status page to "Investigating"
Notify your team
Apply the fix (rollback, restart, contact provider)
Verify recovery (multiple checks, full user flow, response times)
Update status page to "Resolved"
Write a post-mortem within 24 hours

Every Outage Is a Rehearsal

The first time your site goes down, it feels like an emergency. By the fifth time, it's a procedure. The teams that handle downtime well aren't the ones who never have outages — they're the ones who have a playbook and follow it.

This checklist is your playbook. Refine it after each incident. Over time, your mean time to detect drops, your mean time to resolve shrinks, and the chaos that used to consume an hour gets compressed into 10 focused minutes.