Observability for SMBs: Building a €200/mo Monitoring Stack That Actually Alerts
A €200/month Prometheus + Grafana + Loki + Alertmanager stack for the 50-person SMB. Real configs, real dashboards, real alert rules, and the TCO break-even (around 150 hosts) where self-hosted tips back toward managed SaaS.
A 50-person company does not need Datadog. It needs to know when the database is dying, when the disk is filling up, and when the website stopped responding — and it needs to know within minutes, not hours. This article is the €200/month monitoring stack we ship to clients who want serious observability without enterprise pricing.
The stack is Prometheus + Grafana + Loki + Alertmanager, deployed on a small Linux VM with explicit dashboards, real alert rules, and a sustainable operational rhythm. Real numbers, real configs, real costs.
Why not just use Datadog / New Relic / Splunk?
SaaS observability platforms are excellent. They are also priced per-host, per-GB, per-active-user. For a 50-person company with ~30 hosts + Kubernetes + cloud services:
| Tool | Typical monthly cost |
|---|---|
| Datadog (full stack) | €1,800-€4,500 |
| New Relic (full platform) | €1,200-€3,200 |
| Splunk Cloud | €2,500-€6,000 |
| Self-hosted Prometheus stack on Hetzner | €150-€280 |
For 50-person scale, that is €18,000-€72,000/year saved by self-hosting. The operational cost is real (0.1-0.2 FTE engineer time) but smaller than the SaaS premium at this scale. Above 200 employees the calculation tips back toward managed SaaS for the operational simplicity. Below 200, self-hosted wins.
The reference architecture
Components:
- Prometheus: the metrics database + scraper. Pulls metrics from exporters every 15-30 seconds. Stores 30-90 days locally.
- Grafana: dashboarding + visualisation. Reads from Prometheus and Loki.
- Loki: centralised log aggregation. Promtail or Vector agents ship logs from hosts.
- Alertmanager: alert routing + deduplication + silencing. Sends to Slack / Teams / PagerDuty.
- Node exporters / Application exporters: per-host + per-service metrics endpoints.
- Blackbox exporter: external probes (HTTP / TCP / DNS / ICMP) for synthetic monitoring.
Deployed on a single Hetzner CCX13 VM (€20/month, 2 vCPU, 8 GB RAM, 80 GB NVMe) for the first 6-12 months. Scales to a 3-node Prometheus cluster (€60/month) when retention or query load demands it.
The €200/month bill of materials
| Line | Monthly cost |
|---|---|
| Hetzner CCX13 VM (monitoring host) | €20 |
| Hetzner Volume (200 GB SSD for time-series storage) | €10 |
| Hetzner Object Storage (long-term log retention) | €8 |
| Backup target (Hetzner Storage Box, 1 TB) | €11 |
| Grafana Cloud (free tier — used for OnCall + Sandbox) | €0 |
| UptimeRobot Premium (external monitoring of the monitoring stack) | €8 |
| OpsGenie Standard (or Grafana OnCall free) | €10-€80 |
| Operational engineering (0.1 FTE) | ~€100-€150 in equivalent time |
| All-in monthly cost | €167-€287 |
The "operational engineering" line is the largest. The cash cost is around €60-€100/month for the infrastructure itself. The rest is engineer time investment, which is the cost shape every self-hosted decision carries.
The Prometheus configuration that works
The minimum-viable prometheus.yml for a 30-host SMB:
global:
scrape_interval: 30s
evaluation_interval: 30s
external_labels:
environment: prod
organisation: client-name
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- "alerts/*.yaml"
scrape_configs:
- job_name: 'node'
file_sd_configs:
- files: ['/etc/prometheus/targets/hosts.json']
relabel_configs:
- source_labels: [__address__]
target_label: instance
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://www.client-name.com
- https://app.client-name.com
- https://api.client-name.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox:9115
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
storage:
tsdb:
retention.time: 60d
retention.size: 150GB
The Grafana dashboards we ship
The dashboard library — start with these, add as needs surface:
- Infrastructure Overview. CPU / memory / disk / network across all hosts. Single screen. Red / amber / green per host.
- Application Performance. Per-service request rate, error rate, P95 latency (RED metrics). Sorted by error rate descending.
- Database Health. Connection pool usage, slow query count, replication lag, disk space.
- Business KPIs. Signups today, active users, orders processed. Pulled from the application via custom metrics endpoint.
- Incident Investigation. Aggregates logs + metrics for a defined service over a time window. Used during P1 / P2 response.
- SLO Tracker. Error budget burn rate per service. Forecasts when budget will exhaust.
Every dashboard is JSON-exported and version-controlled in Git. Changes go through PR review. Drift between Git and live is reconciled monthly.
The alerting rules that matter
The alert library we ship — these cover ~85% of the genuine paging-worthy incidents in a typical SMB:
groups:
- name: infrastructure
rules:
- alert: HostDown
expr: up{job="node"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Host {{ $labels.instance }} is down"
runbook: "https://docs.internal/runbooks/host-down"
- alert: HighCPU
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 15m
labels:
severity: warning
annotations:
summary: "High CPU on {{ $labels.instance }} ({{ $value | humanize }}%)"
- alert: DiskFillingUp
expr: predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.*"}[6h], 24*3600) < 0
for: 30m
labels:
severity: warning
annotations:
summary: "Disk on {{ $labels.instance }} will fill in <24h"
- alert: DiskCritical
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 5
for: 5m
labels:
severity: critical
annotations:
summary: "Disk on {{ $labels.instance }} at <5% free"
- alert: HighMemory
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 95
for: 10m
labels:
severity: warning
- name: application
rules:
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate >5% on {{ $labels.service }}"
- alert: HighLatencyP95
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m]))
by (service, le)) > 2
for: 10m
labels:
severity: warning
- name: synthetic
rules:
- alert: WebsiteDown
expr: probe_success{job="blackbox-http"} == 0
for: 3m
labels:
severity: critical
annotations:
summary: "Public endpoint {{ $labels.instance }} not responding"
The alerts are scoped, named, and have runbook links. Every alert that fires either pages an engineer (critical) or generates a ticket (warning). No alert exists that does not have a documented response.
The Alertmanager routing
route:
receiver: 'team-default'
group_by: ['alertname', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- matchers:
- severity = critical
receiver: 'pagerduty-critical'
repeat_interval: 1h
- matchers:
- severity = warning
receiver: 'slack-warnings'
repeat_interval: 12h
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: $PAGERDUTY_SERVICE_KEY
- name: 'slack-warnings'
slack_configs:
- api_url: $SLACK_WEBHOOK
channel: '#alerts'
send_resolved: true
Loki for logs
The logs piece. Promtail or Vector agents ship logs to Loki. Loki stores them with the same labels as Prometheus metrics. Grafana queries both with the same query language family.
The trap: shipping every log line at full verbosity. A small estate generates 5-50 GB of logs per day. Loki handles this on the modest VM if configured for retention; without retention policies the disk fills inside a month.
The configuration that works:
- Hot tier (last 14 days): stored on the monitoring VM's NVMe. Queryable instantly.
- Cold tier (14-365 days): stored on Hetzner Object Storage. Queryable with seconds of latency on cold-tier queries.
- Drop list: kernel-message noise, repeating health-check 200s, low-value access-log lines.
- Sample list: high-volume but useful logs (web access logs) sampled at 10%.
The synthetic monitoring piece (do not skip)
The monitoring stack monitors the monitoring stack. If Prometheus itself dies, the alerts you configured do not fire. The defence:
- External probe (UptimeRobot, Hetzner Statuspage probe) verifies the monitoring host is reachable
- External probe verifies the Alertmanager webhook receives test alerts
- Weekly synthetic test: fire a known test-alert; verify it lands in Slack within 60 seconds
Without these defences, you cannot trust that the alerting pipeline is healthy. With them, the meta-monitoring catches silent failures.
The first-day setup script
The minimum-viable bootstrap. Run on a fresh Hetzner CCX13:
#!/bin/bash
# bootstrap-monitoring.sh
set -euo pipefail
# Install Docker
curl -fsSL https://get.docker.com | sh
systemctl enable --now docker
# Pull our pre-configured stack
git clone https://github.com/itsailor/monitoring-starter.git
cd monitoring-starter
# Customise per-client
cp .env.example .env
$EDITOR .env # set ORG_NAME, SLACK_WEBHOOK, PAGERDUTY_KEY, ALERT_EMAIL
# Pull config templates from our reference repo
./scripts/render-config.sh
# Bring up the stack
docker compose up -d
# Verify health
curl -f http://localhost:9090/-/healthy && echo "Prometheus OK"
curl -f http://localhost:3000/api/health && echo "Grafana OK"
curl -f http://localhost:9093/-/healthy && echo "Alertmanager OK"
echo "Stack running. Configure Cloudflare Tunnel for public access:"
echo " cloudflared tunnel create monitoring-client-name"
From clean VM to running stack: ~30 minutes. The Git repo carries all the configs, dashboards, alert rules. Per-client customisation is environment variables + a hosts.json file listing the targets.
What "production-ready" means at this scale
For an SMB observability stack, "production-ready" requires:
- Backup of the monitoring data. Prometheus data on the VM gets backed up nightly to Hetzner Storage Box.
- HA at the data plane. Two Prometheus instances scraping the same targets (rather than a single point of failure). Cheap at this scale.
- Documentation of the runbooks. Every alert links to a runbook. Runbooks are reviewed quarterly.
- External monitoring of the monitoring. As described above.
- Versioned config. Every config + dashboard + alert rule in Git, deployed via CI.
What we have learned from running this for clients
- The first 30 days is alert tuning. Out-of-the-box rules fire too aggressively or too softly. Tune to your reality. Aim for <3 false-positive pages per week per on-call.
- The dashboard library grows by accretion. Engineers ask for specific views. Build them, document them, share them.
- The cost stays under €300/month even at 100 hosts. The Prometheus + Loki retention tier is the major variable; for most SMBs the all-in stays small.
- The on-call burden is real but modest. Expect 2-5 paging-worthy events per month at this scale. Most are infrastructure (disk filling, host issues) not application incidents.
- Migration to SaaS is easy if you grow past the threshold. Prometheus metrics export to Datadog / Grafana Cloud / Mimir cleanly. The stack is portable.
The one paragraph version
A 50-person SMB does not need Datadog. It needs Prometheus + Grafana + Loki + Alertmanager on a small Linux VM for €170-€290/month, with explicit dashboards, named alert rules, runbook-linked routing, and external meta-monitoring. The stack scales to ~150 hosts before TCO tips toward SaaS. The first 30 days is alert tuning; the operational cost is 0.1 FTE. The stack is portable — if you outgrow it, Prometheus exports cleanly to Datadog or Grafana Cloud. Skip the SaaS premium at this scale; spend the difference on engineering capacity.
If you want this designed + deployed + handed over, that is the engagement shape under our Intelligent Workflow Automation service (for the integration with on-call + ticketing) + Azure Cloud Infrastructure for the broader infrastructure landing-zone. The deeper monitoring practice — useful alerts, error-budget SLOs, debt management — is covered in our monitoring-that-doesn't-lie practitioner deep-dive.