Definition
Error rate is the proportion of requests (or monitoring checks) that result in an error instead of a successful response, usually expressed as a percentage over a period. A 2% error rate means 2 out of every 100 requests failed.
Errors include unexpected status codes (like 500 or 503), timeouts, and connection failures. Unlike a binary up/down view, error rate captures partial degradation — a service can be "up" while quietly failing a meaningful slice of requests.
Why It Matters
Error rate often reveals problems before they become full outages. A creeping error rate signals an overloaded database, a flaky dependency, or a bad deploy affecting some users. Watching it lets you act on degradation early, when it's still a warning rather than a crisis.
How It Works
Error rate = (failed requests ÷ total requests) × 100 over a window. For monitoring, each check is a sample: the share of checks that returned an unexpected status, timed out, or failed to connect is your monitored error rate. Setting an error-rate threshold lets you alert on degradation, not just total failure.
Real-World Example
An API handles 10,000 requests in an hour; 150 return HTTP 500. The error rate is 1.5%. It normally sits near 0.1%, so the spike triggers investigation — a dependency is timing out under load, caught well before it would have caused a full outage.
Best Practices
- Alert on error-rate thresholds, not only on total downtime
- Define clearly which responses count as errors
- Watch error-rate trends to catch slow degradation
- Segment error rate by endpoint to localize problems
- Correlate error-rate spikes with deploys and traffic changes
Common Mistakes
- Only monitoring up/down and missing partial failures
- Counting expected non-2xx responses (like 404s) as errors
- Ignoring small but rising error rates until they become outages
- Aggregating all endpoints so a localized problem is hidden
- Setting no baseline, so you can't tell normal from abnormal
In Monitoristic
Monitoristic records each check as a success or failure against your expected status code, so the share of failing checks is effectively your monitored error rate. Watch it move over time to spot degradation before it becomes a full outage.