Definition
An incident is a documented event representing a service disruption, opened when a problem is detected and closed when it's resolved. It bundles together the timeline, updates, and outcome of an outage into a single record.
Where an alert is a momentary notification, an incident is the ongoing story: when it started, what was investigated, what actions were taken, and when service was restored. Incident tracking turns chaotic outages into structured, reviewable events.
Why It Matters
Incidents give outages structure and memory. They coordinate the team during a crisis, communicate status to customers, and create a record you can learn from afterward. Without incident tracking, the same failures recur and the details — crucial for post-mortems and SLA claims — are lost.
How It Works
When monitoring detects a failure (often after confirmation), an incident is created automatically with a start time. As responders work, updates are added; the status page can reflect the incident. When checks pass again, the incident is resolved with an end time, producing a complete timeline used for downtime, MTTR, and reviews.
Real-World Example
Checkout starts failing at 10:02. An incident opens automatically and the team is alerted. They post updates — "investigating," then "rolling back a deploy" — and checkout recovers at 10:19. The incident auto-resolves with a full timeline the team reviews the next day.
Best Practices
- Open incidents automatically on confirmed failures
- Keep a clear timeline: detected, investigating, identified, resolved
- Communicate incidents on a status page to reduce support load
- Resolve incidents promptly and record the end time accurately
- Run brief post-incident reviews to prevent repeats
Common Mistakes
- Handling outages ad hoc with no incident record
- Failing to communicate during an active incident
- Leaving incidents open after the issue is resolved
- Skipping the post-mortem and repeating the same failure
- Losing timestamps needed for MTTR and SLA evidence
In Monitoristic
Monitoristic creates an incident automatically when a monitor goes down and resolves it when the monitor recovers, with a full timeline you, your team, and your public status page can see.