Modern WorkspacePractitioner08 April 2026· 11 min read

Monitoring That Doesn't Lie: Prometheus, Grafana and the Art of Useful Alerts

Making observability produce signal instead of noise. SLO-based alerting with error-budget burn rates, the four-level alert hierarchy (page / ticket / notice / log), Alertmanager inhibit rules, dashboard discipline, and the post-mortem-to-runbook pipeline that compounds operational quality.

The dashboard is full of red. The on-call engineer pages have been firing for an hour. Everyone is in a war room. Nothing is actually broken — it's just CPU briefly hit 87% on an instance that recovered before anyone looked. This is the most expensive failure mode in modern observability: a monitoring system that lies.

This article is the practitioner's guide to building observability that produces signal, not noise. Different angle from our €200/mo SMB monitoring stack article — that one is "what to deploy"; this one is "how to make it useful". Aimed at the IT lead who has Prometheus + Grafana running but is drowning in alerts nobody acts on.

The two failure modes of bad monitoring

Failure mode 01: the noisy stack (pages at 3am for nothing)

Symptoms:

5-15 alerts per day per on-call engineer
Engineers triage the alert, find nothing wrong, close it, move on
Over time, engineers stop reading alert details — they auto-close based on the alert name
The real incident hides in the noise; the engineer scrolls past it

This is the more common failure. Most monitoring stacks start here and stay here until someone forces a tuning sprint.

Failure mode 02: the silent stack (real outages pass undetected)

Symptoms:

The website was down for 45 minutes; customers reported it via Twitter; the team had no alert
The database was at 99% disk for 2 days; nobody noticed until it filled
Backups failed for 3 weeks; only discovered during a quarterly drill
An auto-scaler was misconfigured; the team paid 4x normal cloud bill until the invoice landed

This failure is less common but more damaging per incident.

The four-level alert hierarchy that works

The Google SRE book introduced "symptom-based alerting" as the alternative to threshold-based alerting. The principle: alert on what the user experiences, not on the underlying causes that may or may not affect the user.

Level	Definition	Routing	Acceptable rate
L1 Page	Customer-impacting incident requiring immediate human action	PagerDuty / phone	< 3 / week / engineer
L2 Ticket	Action needed within hours but not minutes	Slack + ticket queue	< 5 / day
L3 Notice	Trend worth knowing about, no immediate action	Slack channel only	Unbounded
L4 Logged	Recorded for post-hoc analysis	Log / dashboard only	Unbounded

Every alert in the stack has a level. The level determines the routing. Engineers know what to do at each level. The signal-to-noise ratio is explicitly designed, not accidentally produced.

Symptom-based alerting in practice

The wrong way (cause-based)

yaml

# BAD: alerts on causes that may or may not produce impact
- alert: HighCPU
  expr: cpu_usage > 80
- alert: HighMemory
  expr: memory_usage > 85
- alert: DiskWarn
  expr: disk_used > 70
- alert: NetworkLatency
  expr: latency_p95 > 100ms

These produce noise. CPU at 81% briefly is fine. Memory at 86% with no swap activity is fine. Disk at 71% with growth trending is fine. P95 latency at 102ms on a non-customer-facing endpoint is fine.

The right way (symptom-based)

yaml

# GOOD: alerts on what users actually experience
- alert: CustomerErrorRateElevated
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
    / sum(rate(http_requests_total[5m])) by (service) > 0.01
  for: 5m
  labels:
    severity: page
  annotations:
    summary: "{{ $labels.service }} 5xx error rate above 1%"

- alert: CustomerLatencyP95Degraded
  expr: |
    histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))
    > 2 * histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[7d])) by (service, le))
  for: 10m
  labels:
    severity: page

These page for real impact: customers seeing errors or experiencing degraded latency. The underlying cause (CPU, memory, disk, network) is investigated after the alert fires; the alert itself is grounded in user experience.

Service Level Objectives (the real signal)

Symptom-based alerting works best when paired with explicit SLOs. The pattern:

Define the SLO. "99.5% of API requests succeed within 2 seconds." "99.9% of database transactions complete within 100ms."
Calculate the error budget. 0.5% of requests can fail before the SLO is missed.
Alert on burn rate. If errors are happening fast enough to exhaust the monthly budget in < 6 hours, page. If fast enough to exhaust in < 3 days, ticket.
Display burn rate. Dashboards show "we have 73% of this month's budget remaining" — not just "things are red right now".

The SLO alert rule

yaml

- alert: APIBurningErrorBudgetFast
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[1h])) by (service)
      / sum(rate(http_requests_total[1h])) by (service)
    ) > (14.4 * 0.005)  # > 14.4x normal error budget burn = exhaust in < 2h
  for: 2m
  labels:
    severity: page
  annotations:
    summary: "{{ $labels.service }} burning error budget fast"
    description: "At current burn rate, monthly error budget exhausts in < 2 hours"

- alert: APIBurningErrorBudgetSlow
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[6h])) by (service)
      / sum(rate(http_requests_total[6h])) by (service)
    ) > (6 * 0.005)  # > 6x normal burn = exhaust in < 3 days
  for: 15m
  labels:
    severity: ticket
  annotations:
    summary: "{{ $labels.service }} on track to exhaust error budget in < 3 days"

The math: monthly budget is 0.5% errors = 21.6 minutes of total error time per month. A 14.4x burn rate exhausts that in 2 hours. A 6x burn rate exhausts in 3 days. The two-tier approach catches both fast-and-acute and slow-and-chronic problems.

The five SLO categories that cover most workloads

Category	Example metric	Typical SLO
Availability	HTTP success rate	99.5% - 99.95%
Latency	P95 response time	< 500ms for user-facing; < 2s for batch
Quality	Successful task completion rate	99.0% - 99.9%
Throughput	Requests handled per second under load	> specified peak capacity
Freshness	Maximum data age in derived datasets	< 15min for analytics; < 1min for transactional

Start with availability + latency for every customer-facing service. Add quality + freshness where the data pipeline matters. Throughput SLOs are advanced; tackle after the others are working.

The "useful alert" checklist

Every alert that fires should pass these tests:

Actionable. The engineer knows what to do next when this fires.
Documented runbook. The alert links to a runbook page describing the response.
Recent. The alert was last reviewed within 90 days.
Owned. A named team is responsible for this alert's lifecycle.
Tested. The alert has fired in production within the last quarter, demonstrating it works.
Right level. If the alert routes to PagerDuty, the response genuinely requires immediate human action.

The quarterly review: walk every alert. Any alert that fails 2+ of these checks gets retired or downgraded. The discipline keeps the alert library honest.

The on-call experience that works

Symptoms of a healthy on-call rotation:

Less than 3 paging events per engineer per week
Less than 1 paging event per engineer per night
Every page produces a written follow-up (post-mortem for incidents, runbook update for noise)
Engineers volunteer for additional shifts to cover colleagues' leave
Burnout-related rotation off the on-call is rare

Symptoms of a broken on-call rotation:

10+ pages per engineer per week
Multiple wake-up pages per night during shift
Engineers go to lengths to avoid being on-call
"On-call recovery" days off after each shift
Engineers leave the team citing on-call quality

The transition from broken to healthy is a 6-9 month project. The investment is the alert-tuning work described in this article + the operational discipline of post-mortem follow-up.

The post-mortem-to-runbook pipeline

Every incident produces a post-mortem. Every post-mortem produces at least one of:

A new alert (so the next instance is caught earlier)
A runbook update (so the next responder has better context)
An alert tuning change (so the original alert is more useful)
A code change (to remove the underlying cause)

The pipeline is the operational discipline that turns incidents into permanent improvements. Without it, the team has the same outage twice in 18 months.

The dashboard discipline (the under-discussed layer)

Bad dashboards are as harmful as bad alerts. Three patterns to avoid:

The "wall of green" dashboard

50 panels, all green, no signal. The engineer cannot tell which panels matter. The "everything is fine" indicator becomes "everything is the same colour".

Fix: ruthless panel reduction. The top-level service dashboard has 4-6 panels max, each representing a single SLO. Detail dashboards drill down from there.

The "log-scale everything" dashboard

Y-axes default to log scale. Subtle changes become invisible. A 2x increase in errors looks like a 1.5x change.

Fix: linear scales by default. Log scale only where the dynamic range truly demands it (network throughput across burst vs idle).

The "missing context" dashboard

Panels showing current value with no historical reference. "P95 latency is 340ms" — is that good or bad? Without the historical baseline, the panel is useless.

Fix: every panel shows current + 7-day rolling baseline + threshold annotations.

The configuration that produces useful alerts

Concrete Prometheus rules we ship to clients:

yaml

groups:
  - name: slo-burn-rate
    rules:
      - alert: APIErrorBudgetFastBurn
        expr: |
          (
            (1 - sum(rate(http_requests_total{status!~"5.."}[1h])) by (service)
                 / sum(rate(http_requests_total[1h])) by (service))
            > (14.4 * (1 - 0.995))
          )
          AND
          (
            (1 - sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
                 / sum(rate(http_requests_total[5m])) by (service))
            > (14.4 * (1 - 0.995))
          )
        for: 2m
        labels:
          severity: page

  - name: infrastructure
    rules:
      - alert: HostDown
        expr: up{job="node-exporter"} == 0
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "Host {{ $labels.instance }} unreachable for 5 minutes"

      - alert: DiskWillFillSoon
        expr: |
          predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.*"}[6h], 24*3600) < 0
        for: 30m
        labels:
          severity: ticket
        annotations:
          summary: "Disk on {{ $labels.instance }} will fill in < 24h based on trend"

      - alert: CertificateExpiringSoon
        expr: probe_ssl_earliest_cert_expiry - time() < 14*24*3600
        for: 1h
        labels:
          severity: ticket

  - name: synthetic
    rules:
      - alert: ProductionEndpointFailing
        expr: probe_success{job="blackbox-http", target="production"} == 0
        for: 3m
        labels:
          severity: page

The rules above produce roughly 3-8 pages per week per service in a healthy state. The level distribution is the design: pages for real customer impact, tickets for trend warnings, notices for awareness.

The Alertmanager routing that survives 02:00

yaml

route:
  receiver: 'default'
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # Page only goes to PagerDuty + on-call rotation
    - matchers:
        - severity = page
      receiver: 'pagerduty'
      repeat_interval: 1h

    # Ticket goes to Slack + ticket queue
    - matchers:
        - severity = ticket
      receiver: 'slack-tickets'
      repeat_interval: 12h
      group_interval: 30m

    # Notice goes only to Slack
    - matchers:
        - severity = notice
      receiver: 'slack-notices'
      repeat_interval: 24h
      group_interval: 4h

inhibit_rules:
  # If host is down, do not also alert on the host's specific services
  - source_matchers:
      - alertname = HostDown
    target_matchers:
      - severity =~ "page|ticket"
    equal: [instance]

The inhibit rule is the under-used feature. When the host is down, every service on that host alerts. The inhibit rule keeps the upstream alert (host down) and suppresses the downstream (services on that host). The on-call engineer wakes up to one page, not fifteen.

The metric-noise patterns to drop

Common noise sources in default metric collections:

Per-CPU core metrics: aggregate to per-host. The detail rarely matters; the volume is significant.
Per-interface kernel counters: drop the loopback + docker0 + virtual interface metrics that produce volume with no signal.
Process-level metrics: keep for critical processes; drop the long tail.
Per-temp-table queries: in databases, exclude the noise of session-scoped temporary tables.
Internal health-check requests: tag and exclude from RED metrics; otherwise they dilute real traffic measurements.

Aggressive metric-noise pruning reduces Prometheus storage cost + improves query performance + makes dashboards cleaner. The discipline is to drop metrics, not just add them.

The annual measurement we track

Two metrics for the monitoring stack itself:

Mean time to detect (MTTD): from incident start to alert firing. Healthy: < 5 minutes for customer-impacting events.
Alert precision: (true-positive alerts) / (total alerts). Healthy: > 80%.

Both numbers visible in a quarterly review with the engineering team. Both numbers drive the tuning sprints when they drift.

What we have learned from operating this for clients

The first month is the hardest. Out-of-the-box alerts are wrong for your specific environment. Budget 4-6 weeks of tuning to get to the steady state.
SLOs require buy-in beyond engineering. Product + finance + customer success need to agree on "99.5% is the target", or the SLO conversation collapses on the first contentious incident.
Runbooks compound in value. The first runbook saves one engineer one hour. The 50th runbook saves three engineers fifteen hours. The library is the most undervalued operational asset.
Post-mortem discipline produces the biggest gain. Without follow-through, alerts deteriorate over 12-18 months. With follow-through, the stack improves quarter-over-quarter.
Vendor-default rules are a starting point, not a destination. Every commercial monitoring tool ships with a default rule library. Every default rule library has roughly 30-40% of rules that are wrong for your specific environment. Audit + tune; don't accept defaults.

Real impact from a recent engagement

A 320-engineer SaaS company with a noisy Datadog environment. Before our engagement:

18 pages per engineer per week
Engineers rotating off on-call due to burnout
Two real customer incidents in the previous 6 months that did not page
~$4k/month over-spend on metrics that nobody used

After 4 months of SLO-based alert tuning + runbook discipline + post-mortem follow-through:

2.7 pages per engineer per week (85% reduction)
Alert precision (true-positive rate): 76% → 91%
MTTD on customer-impacting events: 12 min → 2 min
Engineering retention on the on-call rotation improved measurably (anecdotal but consistent)
Datadog spend down 28% from metric-noise pruning

The one paragraph version

Most monitoring stacks fail in two directions simultaneously: too noisy + too silent on real incidents. The fix is symptom-based alerting on Service Level Objectives, not cause-based alerting on infrastructure thresholds. Every alert has a level (page / ticket / notice / log), a runbook, an owner, and a review date. The Alertmanager inhibit_rule pattern keeps cascading alerts from waking engineers six times. SLO burn-rate alerts catch both fast-and-acute and slow-and-chronic budget burns. The post-mortem-to-runbook pipeline turns incidents into permanent improvements. Measured impact across deployments: 5-8x reduction in page frequency, 15-20 point lift in alert precision, MTTD dropping from 10+ minutes to under 3.

If you want a scoped engagement to tune your monitoring stack, that is the engagement shape under our Intelligent Workflow Automation service. The complementary deployment guidance for the underlying stack is in our €200/month SMB monitoring article.

Keep reading

Related field reports

Browse all reports →

Modern Workspace

3 MIN

Monitoring That Doesn't Lie: Prometheus, Grafana and the Art of Useful Alerts

The two failure modes of bad monitoring

Failure mode 01: the noisy stack (pages at 3am for nothing)

Failure mode 02: the silent stack (real outages pass undetected)

The four-level alert hierarchy that works

Symptom-based alerting in practice

The wrong way (cause-based)

The right way (symptom-based)

Service Level Objectives (the real signal)

The SLO alert rule

The five SLO categories that cover most workloads

The "useful alert" checklist

The on-call experience that works

The post-mortem-to-runbook pipeline

The dashboard discipline (the under-discussed layer)

The "wall of green" dashboard

The "log-scale everything" dashboard

The "missing context" dashboard

The configuration that produces useful alerts

The Alertmanager routing that survives 02:00

The metric-noise patterns to drop

The annual measurement we track

What we have learned from operating this for clients

Real impact from a recent engagement

The one paragraph version

Related field reports

Backup Strategy 2026: 3-2-1 Is Dead — Here's What Replaced It

Need this applied to your stack?

VLAN Design Patterns for Multi-Tenant Office Buildings

Site-to-Site VPN vs SD-WAN: Picking the Right Backbone for Multi-Office Operations