Modern WorkspaceIntro21 April 2026· 8 min read

Observability for SMBs: Building a €200/mo Monitoring Stack That Actually Alerts

A €200/month Prometheus + Grafana + Loki + Alertmanager stack for the 50-person SMB. Real configs, real dashboards, real alert rules, and the TCO break-even (around 150 hosts) where self-hosted tips back toward managed SaaS.

#Observability #Monitoring #Grafana

ITSailor

Senior IT Consultant

A 50-person company does not need Datadog. It needs to know when the database is dying, when the disk is filling up, and when the website stopped responding — and it needs to know within minutes, not hours. This article is the €200/month monitoring stack we ship to clients who want serious observability without enterprise pricing.

The stack is Prometheus + Grafana + Loki + Alertmanager, deployed on a small Linux VM with explicit dashboards, real alert rules, and a sustainable operational rhythm. Real numbers, real configs, real costs.

Why not just use Datadog / New Relic / Splunk?

SaaS observability platforms are excellent. They are also priced per-host, per-GB, per-active-user. For a 50-person company with ~30 hosts + Kubernetes + cloud services:

Tool	Typical monthly cost
Datadog (full stack)	€1,800-€4,500
New Relic (full platform)	€1,200-€3,200
Splunk Cloud	€2,500-€6,000
Self-hosted Prometheus stack on Hetzner	€150-€280

For 50-person scale, that is €18,000-€72,000/year saved by self-hosting. The operational cost is real (0.1-0.2 FTE engineer time) but smaller than the SaaS premium at this scale. Above 200 employees the calculation tips back toward managed SaaS for the operational simplicity. Below 200, self-hosted wins.

The reference architecture

Components:

Prometheus: the metrics database + scraper. Pulls metrics from exporters every 15-30 seconds. Stores 30-90 days locally.
Grafana: dashboarding + visualisation. Reads from Prometheus and Loki.
Loki: centralised log aggregation. Promtail or Vector agents ship logs from hosts.
Alertmanager: alert routing + deduplication + silencing. Sends to Slack / Teams / PagerDuty.
Node exporters / Application exporters: per-host + per-service metrics endpoints.
Blackbox exporter: external probes (HTTP / TCP / DNS / ICMP) for synthetic monitoring.

Deployed on a single Hetzner CCX13 VM (€20/month, 2 vCPU, 8 GB RAM, 80 GB NVMe) for the first 6-12 months. Scales to a 3-node Prometheus cluster (€60/month) when retention or query load demands it.

The €200/month bill of materials

Line	Monthly cost
Hetzner CCX13 VM (monitoring host)	€20
Hetzner Volume (200 GB SSD for time-series storage)	€10
Hetzner Object Storage (long-term log retention)	€8
Backup target (Hetzner Storage Box, 1 TB)	€11
Grafana Cloud (free tier — used for OnCall + Sandbox)	€0
UptimeRobot Premium (external monitoring of the monitoring stack)	€8
OpsGenie Standard (or Grafana OnCall free)	€10-€80
Operational engineering (0.1 FTE)	~€100-€150 in equivalent time
All-in monthly cost	€167-€287

The "operational engineering" line is the largest. The cash cost is around €60-€100/month for the infrastructure itself. The rest is engineer time investment, which is the cost shape every self-hosted decision carries.

The Prometheus configuration that works

The minimum-viable prometheus.yml for a 30-host SMB:

yaml

global:
  scrape_interval: 30s
  evaluation_interval: 30s
  external_labels:
    environment: prod
    organisation: client-name

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - "alerts/*.yaml"

scrape_configs:
  - job_name: 'node'
    file_sd_configs:
      - files: ['/etc/prometheus/targets/hosts.json']
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance

  - job_name: 'blackbox-http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://www.client-name.com
        - https://app.client-name.com
        - https://api.client-name.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox:9115

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

storage:
  tsdb:
    retention.time: 60d
    retention.size: 150GB

The Grafana dashboards we ship

The dashboard library — start with these, add as needs surface:

Infrastructure Overview. CPU / memory / disk / network across all hosts. Single screen. Red / amber / green per host.
Application Performance. Per-service request rate, error rate, P95 latency (RED metrics). Sorted by error rate descending.
Database Health. Connection pool usage, slow query count, replication lag, disk space.
Business KPIs. Signups today, active users, orders processed. Pulled from the application via custom metrics endpoint.
Incident Investigation. Aggregates logs + metrics for a defined service over a time window. Used during P1 / P2 response.
SLO Tracker. Error budget burn rate per service. Forecasts when budget will exhaust.

Every dashboard is JSON-exported and version-controlled in Git. Changes go through PR review. Drift between Git and live is reconciled monthly.

The alerting rules that matter

The alert library we ship — these cover ~85% of the genuine paging-worthy incidents in a typical SMB:

yaml

groups:
  - name: infrastructure
    rules:
      - alert: HostDown
        expr: up{job="node"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Host {{ $labels.instance }} is down"
          runbook: "https://docs.internal/runbooks/host-down"

      - alert: HighCPU
        expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }} ({{ $value | humanize }}%)"

      - alert: DiskFillingUp
        expr: predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.*"}[6h], 24*3600) < 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Disk on {{ $labels.instance }} will fill in <24h"

      - alert: DiskCritical
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk on {{ $labels.instance }} at <5% free"

      - alert: HighMemory
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 95
        for: 10m
        labels:
          severity: warning

  - name: application
    rules:
      - alert: HighErrorRate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
              / sum(rate(http_requests_total[5m])) by (service) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate >5% on {{ $labels.service }}"

      - alert: HighLatencyP95
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m]))
              by (service, le)) > 2
        for: 10m
        labels:
          severity: warning

  - name: synthetic
    rules:
      - alert: WebsiteDown
        expr: probe_success{job="blackbox-http"} == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Public endpoint {{ $labels.instance }} not responding"

The alerts are scoped, named, and have runbook links. Every alert that fires either pages an engineer (critical) or generates a ticket (warning). No alert exists that does not have a documented response.

The Alertmanager routing

yaml

route:
  receiver: 'team-default'
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - matchers:
        - severity = critical
      receiver: 'pagerduty-critical'
      repeat_interval: 1h

    - matchers:
        - severity = warning
      receiver: 'slack-warnings'
      repeat_interval: 12h

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: $PAGERDUTY_SERVICE_KEY

  - name: 'slack-warnings'
    slack_configs:
      - api_url: $SLACK_WEBHOOK
        channel: '#alerts'
        send_resolved: true

Loki for logs

The logs piece. Promtail or Vector agents ship logs to Loki. Loki stores them with the same labels as Prometheus metrics. Grafana queries both with the same query language family.

The trap: shipping every log line at full verbosity. A small estate generates 5-50 GB of logs per day. Loki handles this on the modest VM if configured for retention; without retention policies the disk fills inside a month.

The configuration that works:

Hot tier (last 14 days): stored on the monitoring VM's NVMe. Queryable instantly.
Cold tier (14-365 days): stored on Hetzner Object Storage. Queryable with seconds of latency on cold-tier queries.
Drop list: kernel-message noise, repeating health-check 200s, low-value access-log lines.
Sample list: high-volume but useful logs (web access logs) sampled at 10%.

The synthetic monitoring piece (do not skip)

The monitoring stack monitors the monitoring stack. If Prometheus itself dies, the alerts you configured do not fire. The defence:

External probe (UptimeRobot, Hetzner Statuspage probe) verifies the monitoring host is reachable
External probe verifies the Alertmanager webhook receives test alerts
Weekly synthetic test: fire a known test-alert; verify it lands in Slack within 60 seconds

Without these defences, you cannot trust that the alerting pipeline is healthy. With them, the meta-monitoring catches silent failures.

The first-day setup script

The minimum-viable bootstrap. Run on a fresh Hetzner CCX13:

bash

#!/bin/bash
# bootstrap-monitoring.sh
set -euo pipefail

# Install Docker
curl -fsSL https://get.docker.com | sh
systemctl enable --now docker

# Pull our pre-configured stack
git clone https://github.com/itsailor/monitoring-starter.git
cd monitoring-starter

# Customise per-client
cp .env.example .env
$EDITOR .env  # set ORG_NAME, SLACK_WEBHOOK, PAGERDUTY_KEY, ALERT_EMAIL

# Pull config templates from our reference repo
./scripts/render-config.sh

# Bring up the stack
docker compose up -d

# Verify health
curl -f http://localhost:9090/-/healthy && echo "Prometheus OK"
curl -f http://localhost:3000/api/health && echo "Grafana OK"
curl -f http://localhost:9093/-/healthy && echo "Alertmanager OK"

echo "Stack running. Configure Cloudflare Tunnel for public access:"
echo "  cloudflared tunnel create monitoring-client-name"

From clean VM to running stack: ~30 minutes. The Git repo carries all the configs, dashboards, alert rules. Per-client customisation is environment variables + a hosts.json file listing the targets.

What "production-ready" means at this scale

For an SMB observability stack, "production-ready" requires:

Backup of the monitoring data. Prometheus data on the VM gets backed up nightly to Hetzner Storage Box.
HA at the data plane. Two Prometheus instances scraping the same targets (rather than a single point of failure). Cheap at this scale.
Documentation of the runbooks. Every alert links to a runbook. Runbooks are reviewed quarterly.
External monitoring of the monitoring. As described above.
Versioned config. Every config + dashboard + alert rule in Git, deployed via CI.

What we have learned from running this for clients

The first 30 days is alert tuning. Out-of-the-box rules fire too aggressively or too softly. Tune to your reality. Aim for <3 false-positive pages per week per on-call.
The dashboard library grows by accretion. Engineers ask for specific views. Build them, document them, share them.
The cost stays under €300/month even at 100 hosts. The Prometheus + Loki retention tier is the major variable; for most SMBs the all-in stays small.
The on-call burden is real but modest. Expect 2-5 paging-worthy events per month at this scale. Most are infrastructure (disk filling, host issues) not application incidents.
Migration to SaaS is easy if you grow past the threshold. Prometheus metrics export to Datadog / Grafana Cloud / Mimir cleanly. The stack is portable.

The one paragraph version

A 50-person SMB does not need Datadog. It needs Prometheus + Grafana + Loki + Alertmanager on a small Linux VM for €170-€290/month, with explicit dashboards, named alert rules, runbook-linked routing, and external meta-monitoring. The stack scales to ~150 hosts before TCO tips toward SaaS. The first 30 days is alert tuning; the operational cost is 0.1 FTE. The stack is portable — if you outgrow it, Prometheus exports cleanly to Datadog or Grafana Cloud. Skip the SaaS premium at this scale; spend the difference on engineering capacity.

If you want this designed + deployed + handed over, that is the engagement shape under our Intelligent Workflow Automation service (for the integration with on-call + ticketing) + Azure Cloud Infrastructure for the broader infrastructure landing-zone. The deeper monitoring practice — useful alerts, error-budget SLOs, debt management — is covered in our monitoring-that-doesn't-lie practitioner deep-dive.

Observability for SMBs: Building a €200/mo Monitoring Stack That Actually Alerts

Why not just use Datadog / New Relic / Splunk?

The reference architecture

The €200/month bill of materials

The Prometheus configuration that works

The Grafana dashboards we ship

The alerting rules that matter

The Alertmanager routing

Loki for logs

The synthetic monitoring piece (do not skip)

The first-day setup script

What "production-ready" means at this scale

What we have learned from running this for clients

The one paragraph version

Citations & Further Reading

Need this applied to your stack?