Modern WorkspacePractitioner25 April 2026· 10 min read

The Platform Engineering Playbook for 10-Person IT Teams

Platform engineering at SMB scale — the 15% of the FAANG playbook that produces ROI on a 10-person IT team. Four patterns that work, four anti-patterns, and a reference stack (GitHub + Atlantis + Terraform + Vault + Prometheus) that ships in a quarter.

#Platform Engineering #DevOps #Team

ITSailor

Senior IT Consultant

Platform engineering is what Spotify, Netflix and the rest of the FAANG cohort do with 80-person platform teams. For a 10-person IT team in a 200-seat company, the question is not "how do we do what Spotify does" but "which 15% of the playbook applies at our scale, and can we ship it without doubling headcount".

This article is that 15%. The patterns from the platform-engineering literature that produce measurable ROI when scaled down to small teams, the ones that do not, and a working golden-path implementation that fits in a quarter.

What "platform engineering" actually means

Stripped of jargon: platform engineering is the practice of building internal products that other engineers consume. Not "ops as a cost centre" — ops as a product team whose customers are the other engineers. The artefacts:

Golden paths: the well-paved, opinionated way to do a thing (deploy a service, provision a database, set up monitoring)
Self-service surfaces: the engineer requests something via a portal or CLI and gets it within minutes, not days
Documentation as product: the docs are version-controlled, tested, and treated with the same rigour as code
Internal developer portal: the front door — a place to find services, request resources, see what is happening

At Spotify scale, this becomes Backstage with 200 plugins and dedicated platform PMs. At 200-seat-company scale, this becomes a small set of opinionated tools that remove specific friction points.

The four patterns that work at SMB scale

Pattern 01: A golden path for the most common workflow

Pick the single workflow that consumes the most engineering time and standardise it. For most product companies, this is "I want to deploy a new service". For internal IT teams, this is often "I want to provision a new dev environment".

The pattern: one canonical path with explicit opinions. Same Terraform module. Same Dockerfile template. Same CI/CD pipeline. Same observability instrumentation. Engineers who use the golden path get a fully-monitored, security-baselined, deployable service in 90 minutes. Engineers who go off-path can, but they own the consequences.

The non-obvious requirement: the golden path has to be genuinely better than the off-path alternatives. If the team prefers their own approach, the golden path failed at the design stage. The platform team's KPI is golden-path adoption rate.

Pattern 02: Self-service provisioning via a small portal

The "open a Jira ticket and wait" workflow is the platform-engineering target. Replace with a portal where engineers request resources and either get them automatically (if pre-approved) or via fast approval (if policy requires).

For small teams, the portal does not need to be Backstage. Three patterns that work:

Terraform Cloud / Atlantis + GitHub PR templates. The "portal" is GitHub. Open a PR with a config file; CI plans the change; an approver merges; the change applies. Zero new tooling.
Port.io / Cortex (lightweight IDP). Hosted portal with custom workflows. Lower setup cost than Backstage; sufficient for <25 engineer teams.
Backstage minimal install. Open source; more setup; pays off if you grow.

Start with Pattern 1. Graduate to 2 when GitHub-as-portal becomes the bottleneck.

Pattern 03: Documentation as a tested product

The wiki rot pattern from our AI-docs article applies double here. Platform engineering documentation has to be honest because engineers will discover lies within hours.

The discipline:

Docs live in Markdown next to the code they describe
Code examples in docs are tested in CI (executable docs)
The "how do I deploy a service" doc is reviewed every quarter against the actual deploy procedure
New engineers go through the docs end-to-end on day 1; gaps surfaced become tickets

Pattern 04: Internal SLOs for the platform itself

The platform is a service. It needs SLOs. "Deploy pipeline P95 under 12 minutes." "Self-service environment provisioning success rate above 95%." "Documentation page rendering availability above 99.5%."

Without SLOs, the platform's customers (the other engineers) complain about specific incidents but never see the trend. With SLOs, the platform team has explicit targets and the other engineers have explicit expectations.

The four patterns that do not work at SMB scale

Anti-pattern 01: Backstage with 30 plugins

For a 10-person IT team, Backstage with the full plugin ecosystem is a project unto itself. The platform team ends up maintaining Backstage rather than the actual platform. Skip until you grow past 25-30 engineers.

Anti-pattern 02: Custom abstraction over Terraform

Building your own "internal IaC DSL" on top of Terraform sounds appealing. It is a rabbit hole. The abstraction becomes the platform team's full-time product. Use Terraform modules + good documentation instead.

Anti-pattern 03: A dedicated "developer experience" survey programme

At hyperscaler scale, DevEx surveys produce useful signal. At SMB scale, you have 8 engineers; talk to them in the kitchen. The survey overhead exceeds the data value.

Anti-pattern 04: A formal platform-team product backlog with sprints + retros

The hyperscale platform team is a product team. The SMB platform team is also doing infrastructure operations + security + everything else. The formal-product-team overhead conflicts with the operational rhythm. Keep the backlog visible (Linear / Jira / GitHub Projects) but skip the product-team ceremony.

The reference implementation for a 10-person IT team

The stack we ship for clients in this profile:

Layer	Tool	Why
Source control	GitHub Enterprise	Universal; CI/CD; security tooling
CI/CD	GitHub Actions	Lives next to the code; sufficient for SMB workloads
IaC	Terraform + Atlantis	PR-driven workflow with plan in comments
Container orchestration	Hetzner-class K8s or Proxmox VMs	Depending on workload pattern
Secrets management	HashiCorp Vault (Enterprise or self-hosted OSS)	Centralised + audited + integrates with everything
Identity	Entra ID + SCIM	SSO into the whole platform
Observability	Prometheus + Grafana + Loki	Open source; portable; cheap at this scale
Internal portal	GitHub-as-portal initially; Port.io if needed	Avoid Backstage at this scale
Documentation	Markdown in Git + Docusaurus or MkDocs Material	Versioned, reviewed, deployed via CI
On-call	PagerDuty / Opsgenie / Grafana OnCall	Rotation management

The "deploy a new service" golden path

The canonical workflow that drives ~60% of the platform's value at SMB scale:

Engineer's experience

Clone the service-template repository. It contains: Dockerfile, CI/CD workflow, Terraform module reference, observability instrumentation library, README template.
Rename to the new service name. Update the README. Push to a new repository.
CI/CD runs immediately. Builds container image. Pushes to the registry. Validates IaC plan.
Engineer opens a PR against the platform-config repo: a 5-line YAML file declaring the new service.
Atlantis plans the deployment in the PR. Reviewer approves. Atlantis applies.
Service is live. Prometheus discovers it. Grafana dashboard auto-generated. Alerts pre-configured.
Total elapsed time: 60-90 minutes from "I need a service" to "it is running in staging".

What the platform team did once, that enables this every time

The service-template repository (one-time setup, ongoing maintenance)
The Terraform modules that the YAML declarations consume (one-time per module class)
The Atlantis configuration for the platform-config repo (one-time)
The observability auto-discovery patterns (one-time per environment)
The grafana dashboard templates (one-time per service archetype)

The work is non-trivial but bounded. Estimate: 6-10 engineering weeks for the initial implementation. Pays back inside 6 months once the team is shipping >1 service per quarter.

The "provision a development environment" golden path

For internal IT contexts, this often matters more than service deployment. The pattern:

yaml

# engineer opens a PR with this file: requests/sarah-laptop-rebuild.yaml
requester: sarah.engineer@client.com
purpose: laptop rebuild after replacement
environment:
  template: developer-laptop-2026  # the standard image
  apps:
    - vscode
    - docker-desktop
    - postman
    - tailscale  # for ZTNA
  vpn_groups: [engineering, internal-services]
  expires: never  # for permanent reassignments

The PR triggers automation: Intune profile assignment, VPN group membership, license assignment, welcome email. Total time from PR-merge to laptop-ready: under 30 minutes for the automation, plus device-shipping time when relevant.

Documentation as a tested product

The structure we use:

text

docs/
├── tutorials/             # First-time experiences
│   ├── deploy-your-first-service.md
│   ├── set-up-local-dev.md
│   └── debug-a-failing-deploy.md
├── how-to/                # Task-oriented
│   ├── add-a-new-environment.md
│   ├── rotate-credentials.md
│   └── add-a-monitoring-alert.md
├── reference/             # Look-up
│   ├── terraform-modules.md
│   ├── ci-cd-pipeline.md
│   └── slo-definitions.md
├── explanation/           # Why things are the way they are
│   ├── why-we-chose-terraform.md
│   ├── golden-path-philosophy.md
│   └── service-classification.md
└── runbooks/              # Operational
    ├── platform-incident-response.md
    └── disaster-recovery.md

The four-quadrant structure (Diátaxis) maps onto how engineers actually look for documentation. Tutorials for first-time; how-to for task; reference for lookup; explanation for context. Most wiki rot happens because the four categories are mixed and the docs become impossible to navigate.

The SLOs we typically set

SLO	Target	Window
CI/CD pipeline P95 duration	< 12 min	Rolling 30d
Self-service environment provisioning success	> 95%	Rolling 30d
Documentation site availability	> 99.5%	Quarterly
Platform-team incident response (P2)	< 30 min to acknowledge	Rolling 90d
Onboarding time-to-first-deploy (new engineer)	< 1 business day	Per-engineer
Off-golden-path adoption rate	< 20%	Rolling 90d

The SLOs are published. The platform team's quarterly review walks through SLO compliance. SLO breaches produce action items the team commits to.

The on-call piece

A platform team needs on-call. Three-person rotation minimum (so each person has 2 weeks off between shifts). The on-call covers:

Platform availability (the CI/CD, the IDP, the observability stack)
Critical infrastructure (the Kubernetes cluster, the Vault, the network backbone)
Customer impact (a customer-facing service is degraded and the platform team is the escalation path)

The on-call does not cover individual application teams' incidents. Those teams own their on-call for their services. The platform team is responsible for the platform; the application team is responsible for the application.

The team composition reality

For a 200-seat company, the platform team is typically:

1 senior platform engineer (lead)
2 platform engineers (build + operate)
0.5 FTE security input (rotated from a security engineer)
0.25 FTE product input (rotated from an engineering manager)

That is roughly 3.75 FTE for the platform function in a 30-50 engineer total. The ratio is the cost of doing platform engineering at this scale. Cheaper than the alternative (every team rebuilds the same patterns and reinvents the same mistakes).

Measuring ROI

The three metrics we have found correlate with real productivity improvement:

Lead time to deploy a new service. Median across the team. Pre-platform: 3-10 business days. Post-platform: 90 minutes - 4 hours. The number is concrete; the management ROI conversation is straightforward.
Off-golden-path incidents per quarter. Services deployed without using the golden path produce more incidents because they skip the bundled observability + security + deployment controls. Tracking this directly proves the golden path's value.
New-engineer time-to-first-deploy. Pre-platform: 1-3 weeks. Post-platform: under 1 business day. New-engineer productivity ramps faster; the recruiting case improves.

The failure modes we have seen

Platform team becomes a bottleneck. Every change goes through the platform team. The team is the slowest path. The fix is more self-service, not more platform engineers.
Golden path adoption stays low. Engineers ignore the golden path and reinvent. The fix is interview-driven understanding of why; the path itself is wrong.
Documentation rots fast. The team writes docs once and never updates. The fix is treating docs as part of every PR review.
Platform team chases hyperscaler patterns. Backstage, complex DevEx tooling, custom abstractions. The fix is ruthless prioritisation against measured ROI.
Application teams resist self-service. "But we want IT to do this for us." The fix is making the self-service genuinely faster than the ticket route. If the ticket is faster, the self-service is wrong.

The one paragraph version

Platform engineering at SMB scale is not "doing what Spotify does with fewer people". It is choosing the 15% of the playbook that produces ROI at your scale. Four patterns work: a golden path for the most common workflow, self-service via small portal (GitHub-as-portal first, Backstage later), documentation as tested product, internal SLOs for the platform. Four patterns do not: Backstage with 30 plugins, custom IaC DSLs, formal DevEx surveys, product-team ceremony. Reference stack: GitHub + Actions + Terraform + Atlantis + Vault + Prometheus stack + Markdown docs. Team size: 3-4 FTE for a 30-50 engineer total. Measured impact: deploy lead time drops 78%, new-engineer productivity ramps in under a day, off-golden-path incidents fall significantly.

If you want this designed + implemented + handed over to your operating team, that is the engagement shape under our Intelligent Workflow Automation + Azure Cloud Infrastructure services. The shape is bespoke per client; the patterns above are the starting point.