Definition
MTTR (Mean Time to Recovery, sometimes Repair or Resolution) is the average duration from when an outage starts to when the service is fully restored, across multiple incidents. It measures how quickly you bounce back from failure rather than how often failure happens.
MTTR includes detection, diagnosis, and fix time. A low MTTR means problems are caught and resolved fast; a high MTTR means outages drag on, compounding their impact even if they're rare.
Why It Matters
Failures are inevitable, so how fast you recover often matters more than how often you fail. MTTR directly affects total downtime and customer impact: cutting MTTR in half halves the damage from every future incident. It's also a clear signal of how mature your incident response is.
How It Works
MTTR = total recovery time across incidents ÷ number of incidents. For each outage you record the start (when failure began) and the end (when service was restored); the average of those durations is your MTTR. Faster detection (frequent monitoring + instant alerts) is usually the cheapest way to reduce it.
Real-World Example
Over a quarter a team has four outages lasting 30, 12, 20, and 18 minutes. MTTR = (30 + 12 + 20 + 18) ÷ 4 = 20 minutes. After adding 1-minute monitoring and Telegram alerts, detection time drops and the next quarter's MTTR falls to 11 minutes.
Best Practices
- Reduce detection time first — it's often the biggest, cheapest MTTR win
- Use instant alerts so the right person knows immediately
- Keep runbooks for common failures to speed diagnosis
- Track MTTR over time to confirm incident response is improving
- Record precise start and end timestamps for every incident
Common Mistakes
- Measuring MTTR only from when a human noticed, ignoring detection delay
- Not recording incident timestamps, making MTTR impossible to compute
- Focusing only on reducing failure frequency and ignoring recovery speed
- Lacking runbooks, so every incident is diagnosed from scratch
- Averaging too few incidents to draw meaningful conclusions
In Monitoristic
Monitoristic timestamps when an incident opens and when it resolves, so you can calculate MTTR straight from the incident timeline. Fast detection (checks plus instant Telegram/webhook alerts) helps keep the recovery clock short.