Monitoring That Doesn't Lie: Prometheus, Grafana and the Art of Useful Alerts
Making observability produce signal instead of noise. SLO-based alerting with error-budget burn rates, the four-level alert hierarchy (page / ticket / notice / log), Alertmanager inhibit rules, dashboard discipline, and the post-mortem-to-runbook pipeline that compounds operational quality.
The dashboard is full of red. The on-call engineer pages have been firing for an hour. Everyone is in a war room. Nothing is actually broken — it's just CPU briefly hit 87% on an instance that recovered before anyone looked. This is the most expensive failure mode in modern observability: a monitoring system that lies.
This article is the practitioner's guide to building observability that produces signal, not noise. Different angle from our €200/mo SMB monitoring stack article — that one is "what to deploy"; this one is "how to make it useful". Aimed at the IT lead who has Prometheus + Grafana running but is drowning in alerts nobody acts on.
The two failure modes of bad monitoring
Failure mode 01: the noisy stack (pages at 3am for nothing)
Symptoms:
- 5-15 alerts per day per on-call engineer
- Engineers triage the alert, find nothing wrong, close it, move on
- Over time, engineers stop reading alert details — they auto-close based on the alert name
- The real incident hides in the noise; the engineer scrolls past it
This is the more common failure. Most monitoring stacks start here and stay here until someone forces a tuning sprint.
Failure mode 02: the silent stack (real outages pass undetected)
Symptoms:
- The website was down for 45 minutes; customers reported it via Twitter; the team had no alert
- The database was at 99% disk for 2 days; nobody noticed until it filled
- Backups failed for 3 weeks; only discovered during a quarterly drill
- An auto-scaler was misconfigured; the team paid 4x normal cloud bill until the invoice landed
This failure is less common but more damaging per incident.
The four-level alert hierarchy that works
The Google SRE book introduced "symptom-based alerting" as the alternative to threshold-based alerting. The principle: alert on what the user experiences, not on the underlying causes that may or may not affect the user.
| Level | Definition | Routing | Acceptable rate |
|---|---|---|---|
| L1 Page | Customer-impacting incident requiring immediate human action | PagerDuty / phone | < 3 / week / engineer |
| L2 Ticket | Action needed within hours but not minutes | Slack + ticket queue | < 5 / day |
| L3 Notice | Trend worth knowing about, no immediate action | Slack channel only | Unbounded |
| L4 Logged | Recorded for post-hoc analysis | Log / dashboard only | Unbounded |
Every alert in the stack has a level. The level determines the routing. Engineers know what to do at each level. The signal-to-noise ratio is explicitly designed, not accidentally produced.
Symptom-based alerting in practice
The wrong way (cause-based)
# BAD: alerts on causes that may or may not produce impact
- alert: HighCPU
expr: cpu_usage > 80
- alert: HighMemory
expr: memory_usage > 85
- alert: DiskWarn
expr: disk_used > 70
- alert: NetworkLatency
expr: latency_p95 > 100ms
These produce noise. CPU at 81% briefly is fine. Memory at 86% with no swap activity is fine. Disk at 71% with growth trending is fine. P95 latency at 102ms on a non-customer-facing endpoint is fine.
The right way (symptom-based)
# GOOD: alerts on what users actually experience
- alert: CustomerErrorRateElevated
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service) > 0.01
for: 5m
labels:
severity: page
annotations:
summary: "{{ $labels.service }} 5xx error rate above 1%"
- alert: CustomerLatencyP95Degraded
expr: |
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))
> 2 * histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[7d])) by (service, le))
for: 10m
labels:
severity: page
These page for real impact: customers seeing errors or experiencing degraded latency. The underlying cause (CPU, memory, disk, network) is investigated after the alert fires; the alert itself is grounded in user experience.
Service Level Objectives (the real signal)
Symptom-based alerting works best when paired with explicit SLOs. The pattern:
- Define the SLO. "99.5% of API requests succeed within 2 seconds." "99.9% of database transactions complete within 100ms."
- Calculate the error budget. 0.5% of requests can fail before the SLO is missed.
- Alert on burn rate. If errors are happening fast enough to exhaust the monthly budget in < 6 hours, page. If fast enough to exhaust in < 3 days, ticket.
- Display burn rate. Dashboards show "we have 73% of this month's budget remaining" — not just "things are red right now".
The SLO alert rule
- alert: APIBurningErrorBudgetFast
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h])) by (service)
/ sum(rate(http_requests_total[1h])) by (service)
) > (14.4 * 0.005) # > 14.4x normal error budget burn = exhaust in < 2h
for: 2m
labels:
severity: page
annotations:
summary: "{{ $labels.service }} burning error budget fast"
description: "At current burn rate, monthly error budget exhausts in < 2 hours"
- alert: APIBurningErrorBudgetSlow
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h])) by (service)
/ sum(rate(http_requests_total[6h])) by (service)
) > (6 * 0.005) # > 6x normal burn = exhaust in < 3 days
for: 15m
labels:
severity: ticket
annotations:
summary: "{{ $labels.service }} on track to exhaust error budget in < 3 days"
The math: monthly budget is 0.5% errors = 21.6 minutes of total error time per month. A 14.4x burn rate exhausts that in 2 hours. A 6x burn rate exhausts in 3 days. The two-tier approach catches both fast-and-acute and slow-and-chronic problems.
The five SLO categories that cover most workloads
| Category | Example metric | Typical SLO |
|---|---|---|
| Availability | HTTP success rate | 99.5% - 99.95% |
| Latency | P95 response time | < 500ms for user-facing; < 2s for batch |
| Quality | Successful task completion rate | 99.0% - 99.9% |
| Throughput | Requests handled per second under load | > specified peak capacity |
| Freshness | Maximum data age in derived datasets | < 15min for analytics; < 1min for transactional |
Start with availability + latency for every customer-facing service. Add quality + freshness where the data pipeline matters. Throughput SLOs are advanced; tackle after the others are working.
The "useful alert" checklist
Every alert that fires should pass these tests:
- Actionable. The engineer knows what to do next when this fires.
- Documented runbook. The alert links to a runbook page describing the response.
- Recent. The alert was last reviewed within 90 days.
- Owned. A named team is responsible for this alert's lifecycle.
- Tested. The alert has fired in production within the last quarter, demonstrating it works.
- Right level. If the alert routes to PagerDuty, the response genuinely requires immediate human action.
The quarterly review: walk every alert. Any alert that fails 2+ of these checks gets retired or downgraded. The discipline keeps the alert library honest.
The on-call experience that works
Symptoms of a healthy on-call rotation:
- Less than 3 paging events per engineer per week
- Less than 1 paging event per engineer per night
- Every page produces a written follow-up (post-mortem for incidents, runbook update for noise)
- Engineers volunteer for additional shifts to cover colleagues' leave
- Burnout-related rotation off the on-call is rare
Symptoms of a broken on-call rotation:
- 10+ pages per engineer per week
- Multiple wake-up pages per night during shift
- Engineers go to lengths to avoid being on-call
- "On-call recovery" days off after each shift
- Engineers leave the team citing on-call quality
The transition from broken to healthy is a 6-9 month project. The investment is the alert-tuning work described in this article + the operational discipline of post-mortem follow-up.
The post-mortem-to-runbook pipeline
Every incident produces a post-mortem. Every post-mortem produces at least one of:
- A new alert (so the next instance is caught earlier)
- A runbook update (so the next responder has better context)
- An alert tuning change (so the original alert is more useful)
- A code change (to remove the underlying cause)
The pipeline is the operational discipline that turns incidents into permanent improvements. Without it, the team has the same outage twice in 18 months.
The dashboard discipline (the under-discussed layer)
Bad dashboards are as harmful as bad alerts. Three patterns to avoid:
The "wall of green" dashboard
50 panels, all green, no signal. The engineer cannot tell which panels matter. The "everything is fine" indicator becomes "everything is the same colour".
Fix: ruthless panel reduction. The top-level service dashboard has 4-6 panels max, each representing a single SLO. Detail dashboards drill down from there.
The "log-scale everything" dashboard
Y-axes default to log scale. Subtle changes become invisible. A 2x increase in errors looks like a 1.5x change.
Fix: linear scales by default. Log scale only where the dynamic range truly demands it (network throughput across burst vs idle).
The "missing context" dashboard
Panels showing current value with no historical reference. "P95 latency is 340ms" — is that good or bad? Without the historical baseline, the panel is useless.
Fix: every panel shows current + 7-day rolling baseline + threshold annotations.
The configuration that produces useful alerts
Concrete Prometheus rules we ship to clients:
groups:
- name: slo-burn-rate
rules:
- alert: APIErrorBudgetFastBurn
expr: |
(
(1 - sum(rate(http_requests_total{status!~"5.."}[1h])) by (service)
/ sum(rate(http_requests_total[1h])) by (service))
> (14.4 * (1 - 0.995))
)
AND
(
(1 - sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service))
> (14.4 * (1 - 0.995))
)
for: 2m
labels:
severity: page
- name: infrastructure
rules:
- alert: HostDown
expr: up{job="node-exporter"} == 0
for: 5m
labels:
severity: page
annotations:
summary: "Host {{ $labels.instance }} unreachable for 5 minutes"
- alert: DiskWillFillSoon
expr: |
predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.*"}[6h], 24*3600) < 0
for: 30m
labels:
severity: ticket
annotations:
summary: "Disk on {{ $labels.instance }} will fill in < 24h based on trend"
- alert: CertificateExpiringSoon
expr: probe_ssl_earliest_cert_expiry - time() < 14*24*3600
for: 1h
labels:
severity: ticket
- name: synthetic
rules:
- alert: ProductionEndpointFailing
expr: probe_success{job="blackbox-http", target="production"} == 0
for: 3m
labels:
severity: page
The rules above produce roughly 3-8 pages per week per service in a healthy state. The level distribution is the design: pages for real customer impact, tickets for trend warnings, notices for awareness.
The Alertmanager routing that survives 02:00
route:
receiver: 'default'
group_by: ['alertname', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Page only goes to PagerDuty + on-call rotation
- matchers:
- severity = page
receiver: 'pagerduty'
repeat_interval: 1h
# Ticket goes to Slack + ticket queue
- matchers:
- severity = ticket
receiver: 'slack-tickets'
repeat_interval: 12h
group_interval: 30m
# Notice goes only to Slack
- matchers:
- severity = notice
receiver: 'slack-notices'
repeat_interval: 24h
group_interval: 4h
inhibit_rules:
# If host is down, do not also alert on the host's specific services
- source_matchers:
- alertname = HostDown
target_matchers:
- severity =~ "page|ticket"
equal: [instance]
The inhibit rule is the under-used feature. When the host is down, every service on that host alerts. The inhibit rule keeps the upstream alert (host down) and suppresses the downstream (services on that host). The on-call engineer wakes up to one page, not fifteen.
The metric-noise patterns to drop
Common noise sources in default metric collections:
- Per-CPU core metrics: aggregate to per-host. The detail rarely matters; the volume is significant.
- Per-interface kernel counters: drop the loopback + docker0 + virtual interface metrics that produce volume with no signal.
- Process-level metrics: keep for critical processes; drop the long tail.
- Per-temp-table queries: in databases, exclude the noise of session-scoped temporary tables.
- Internal health-check requests: tag and exclude from RED metrics; otherwise they dilute real traffic measurements.
Aggressive metric-noise pruning reduces Prometheus storage cost + improves query performance + makes dashboards cleaner. The discipline is to drop metrics, not just add them.
The annual measurement we track
Two metrics for the monitoring stack itself:
- Mean time to detect (MTTD): from incident start to alert firing. Healthy: < 5 minutes for customer-impacting events.
- Alert precision: (true-positive alerts) / (total alerts). Healthy: > 80%.
Both numbers visible in a quarterly review with the engineering team. Both numbers drive the tuning sprints when they drift.
What we have learned from operating this for clients
- The first month is the hardest. Out-of-the-box alerts are wrong for your specific environment. Budget 4-6 weeks of tuning to get to the steady state.
- SLOs require buy-in beyond engineering. Product + finance + customer success need to agree on "99.5% is the target", or the SLO conversation collapses on the first contentious incident.
- Runbooks compound in value. The first runbook saves one engineer one hour. The 50th runbook saves three engineers fifteen hours. The library is the most undervalued operational asset.
- Post-mortem discipline produces the biggest gain. Without follow-through, alerts deteriorate over 12-18 months. With follow-through, the stack improves quarter-over-quarter.
- Vendor-default rules are a starting point, not a destination. Every commercial monitoring tool ships with a default rule library. Every default rule library has roughly 30-40% of rules that are wrong for your specific environment. Audit + tune; don't accept defaults.
Real impact from a recent engagement
A 320-engineer SaaS company with a noisy Datadog environment. Before our engagement:
- 18 pages per engineer per week
- Engineers rotating off on-call due to burnout
- Two real customer incidents in the previous 6 months that did not page
- ~$4k/month over-spend on metrics that nobody used
After 4 months of SLO-based alert tuning + runbook discipline + post-mortem follow-through:
- 2.7 pages per engineer per week (85% reduction)
- Alert precision (true-positive rate): 76% → 91%
- MTTD on customer-impacting events: 12 min → 2 min
- Engineering retention on the on-call rotation improved measurably (anecdotal but consistent)
- Datadog spend down 28% from metric-noise pruning
The one paragraph version
Most monitoring stacks fail in two directions simultaneously: too noisy + too silent on real incidents. The fix is symptom-based alerting on Service Level Objectives, not cause-based alerting on infrastructure thresholds. Every alert has a level (page / ticket / notice / log), a runbook, an owner, and a review date. The Alertmanager inhibit_rule pattern keeps cascading alerts from waking engineers six times. SLO burn-rate alerts catch both fast-and-acute and slow-and-chronic budget burns. The post-mortem-to-runbook pipeline turns incidents into permanent improvements. Measured impact across deployments: 5-8x reduction in page frequency, 15-20 point lift in alert precision, MTTD dropping from 10+ minutes to under 3.
If you want a scoped engagement to tune your monitoring stack, that is the engagement shape under our Intelligent Workflow Automation service. The complementary deployment guidance for the underlying stack is in our €200/month SMB monitoring article.