What Is MTTR (Mean Time to Recovery)? — Uptime Monitoring Glossary

Definition

MTTR (Mean Time to Recovery, sometimes Repair or Resolution) is the average duration from when an outage starts to when the service is fully restored, across multiple incidents. It measures how quickly you bounce back from failure rather than how often failure happens.

MTTR includes detection, diagnosis, and fix time. A low MTTR means problems are caught and resolved fast; a high MTTR means outages drag on, compounding their impact even if they're rare.

Why It Matters

Failures are inevitable, so how fast you recover often matters more than how often you fail. MTTR directly affects total downtime and customer impact: cutting MTTR in half halves the damage from every future incident. It's also a clear signal of how mature your incident response is.

How It Works

MTTR = total recovery time across incidents ÷ number of incidents. For each outage you record the start (when failure began) and the end (when service was restored); the average of those durations is your MTTR. Faster detection (frequent monitoring + instant alerts) is usually the cheapest way to reduce it.

Real-World Example

Over a quarter a team has four outages lasting 30, 12, 20, and 18 minutes. MTTR = (30 + 12 + 20 + 18) ÷ 4 = 20 minutes. After adding 1-minute monitoring and Telegram alerts, detection time drops and the next quarter's MTTR falls to 11 minutes.

Best Practices

Reduce detection time first — it's often the biggest, cheapest MTTR win
Use instant alerts so the right person knows immediately
Keep runbooks for common failures to speed diagnosis
Track MTTR over time to confirm incident response is improving
Record precise start and end timestamps for every incident

Common Mistakes

Measuring MTTR only from when a human noticed, ignoring detection delay
Not recording incident timestamps, making MTTR impossible to compute
Focusing only on reducing failure frequency and ignoring recovery speed
Lacking runbooks, so every incident is diagnosed from scratch
Averaging too few incidents to draw meaningful conclusions

In Monitoristic

Monitoristic timestamps when an incident opens and when it resolves, so you can calculate MTTR straight from the incident timeline. Fast detection (checks plus instant Telegram/webhook alerts) helps keep the recovery clock short.

Start monitoring →

Frequently Asked Questions

What does MTTR stand for?

Most commonly Mean Time to Recovery (or Repair/Resolution) — the average time to restore service after an outage starts.

What is a good MTTR?

It varies by service, but lower is always better. Strong teams measure MTTR in minutes by detecting fast and following clear runbooks.

How is MTTR different from MTBF?

MTTR measures how quickly you recover from failures; MTBF measures how long the service runs between failures. Together they describe reliability.

How do I reduce MTTR?

Detect faster with frequent monitoring and instant alerts, keep runbooks for common issues, and rehearse incident response.