Definition
An outage is a discrete period during which a service is down or not working as intended. Where downtime is the cumulative measure, an outage is the individual incident — "the API outage on Tuesday" — with a start, a duration, and a resolution.
Outages range from total (everything is offline) to partial (one region, one feature, or one dependency is affected). Both matter, because even a partial outage can block a critical user flow like checkout or login.
Why It Matters
Outages are the events that damage revenue and trust, and how you handle them defines your reliability reputation. Detecting an outage quickly, communicating it clearly, and resolving it fast are what separate a minor blip from a crisis. Tracking outages also reveals patterns worth fixing.
How It Works
Monitoring detects an outage when checks start failing, records the start time, and (with alerting) notifies your team. When checks succeed again, the outage end time is recorded and the incident is resolved. The outage duration feeds your downtime and uptime numbers.
Real-World Example
At 9:14 AM a deploy breaks the login endpoint. Monitoring detects failing checks, opens an incident, and alerts the team. The bad deploy is rolled back and login recovers at 9:31 AM. The outage lasted 17 minutes and is logged with a full timeline.
Best Practices
- Detect outages fast with frequent checks and instant alerts
- Communicate outages on a status page to reduce support load
- Record a timeline for every outage: detected, investigating, resolved
- Run a brief post-incident review to prevent repeats
- Distinguish partial outages so you understand real user impact
Common Mistakes
- Finding out about outages from customers instead of monitoring
- Staying silent during an outage instead of posting status updates
- Not recording outage timelines, so lessons are lost
- Treating every outage as total when many are partial
- Skipping the post-mortem and repeating the same failure
In Monitoristic
When Monitoristic records failed checks it opens an incident automatically, notifies you via Telegram and webhooks, and re-checks every 60 seconds. The incident timeline captures the start, any updates, and the resolution — and can be shown on your public status page.