The Platform Engineering Playbook for 10-Person IT Teams
Platform engineering at SMB scale — the 15% of the FAANG playbook that produces ROI on a 10-person IT team. Four patterns that work, four anti-patterns, and a reference stack (GitHub + Atlantis + Terraform + Vault + Prometheus) that ships in a quarter.
Platform engineering is what Spotify, Netflix and the rest of the FAANG cohort do with 80-person platform teams. For a 10-person IT team in a 200-seat company, the question is not "how do we do what Spotify does" but "which 15% of the playbook applies at our scale, and can we ship it without doubling headcount".
This article is that 15%. The patterns from the platform-engineering literature that produce measurable ROI when scaled down to small teams, the ones that do not, and a working golden-path implementation that fits in a quarter.
What "platform engineering" actually means
Stripped of jargon: platform engineering is the practice of building internal products that other engineers consume. Not "ops as a cost centre" — ops as a product team whose customers are the other engineers. The artefacts:
- Golden paths: the well-paved, opinionated way to do a thing (deploy a service, provision a database, set up monitoring)
- Self-service surfaces: the engineer requests something via a portal or CLI and gets it within minutes, not days
- Documentation as product: the docs are version-controlled, tested, and treated with the same rigour as code
- Internal developer portal: the front door — a place to find services, request resources, see what is happening
At Spotify scale, this becomes Backstage with 200 plugins and dedicated platform PMs. At 200-seat-company scale, this becomes a small set of opinionated tools that remove specific friction points.
The four patterns that work at SMB scale
Pattern 01: A golden path for the most common workflow
Pick the single workflow that consumes the most engineering time and standardise it. For most product companies, this is "I want to deploy a new service". For internal IT teams, this is often "I want to provision a new dev environment".
The pattern: one canonical path with explicit opinions. Same Terraform module. Same Dockerfile template. Same CI/CD pipeline. Same observability instrumentation. Engineers who use the golden path get a fully-monitored, security-baselined, deployable service in 90 minutes. Engineers who go off-path can, but they own the consequences.
The non-obvious requirement: the golden path has to be genuinely better than the off-path alternatives. If the team prefers their own approach, the golden path failed at the design stage. The platform team's KPI is golden-path adoption rate.
Pattern 02: Self-service provisioning via a small portal
The "open a Jira ticket and wait" workflow is the platform-engineering target. Replace with a portal where engineers request resources and either get them automatically (if pre-approved) or via fast approval (if policy requires).
For small teams, the portal does not need to be Backstage. Three patterns that work:
- Terraform Cloud / Atlantis + GitHub PR templates. The "portal" is GitHub. Open a PR with a config file; CI plans the change; an approver merges; the change applies. Zero new tooling.
- Port.io / Cortex (lightweight IDP). Hosted portal with custom workflows. Lower setup cost than Backstage; sufficient for <25 engineer teams.
- Backstage minimal install. Open source; more setup; pays off if you grow.
Start with Pattern 1. Graduate to 2 when GitHub-as-portal becomes the bottleneck.
Pattern 03: Documentation as a tested product
The wiki rot pattern from our AI-docs article applies double here. Platform engineering documentation has to be honest because engineers will discover lies within hours.
The discipline:
- Docs live in Markdown next to the code they describe
- Code examples in docs are tested in CI (executable docs)
- The "how do I deploy a service" doc is reviewed every quarter against the actual deploy procedure
- New engineers go through the docs end-to-end on day 1; gaps surfaced become tickets
Pattern 04: Internal SLOs for the platform itself
The platform is a service. It needs SLOs. "Deploy pipeline P95 under 12 minutes." "Self-service environment provisioning success rate above 95%." "Documentation page rendering availability above 99.5%."
Without SLOs, the platform's customers (the other engineers) complain about specific incidents but never see the trend. With SLOs, the platform team has explicit targets and the other engineers have explicit expectations.
The four patterns that do not work at SMB scale
Anti-pattern 01: Backstage with 30 plugins
For a 10-person IT team, Backstage with the full plugin ecosystem is a project unto itself. The platform team ends up maintaining Backstage rather than the actual platform. Skip until you grow past 25-30 engineers.
Anti-pattern 02: Custom abstraction over Terraform
Building your own "internal IaC DSL" on top of Terraform sounds appealing. It is a rabbit hole. The abstraction becomes the platform team's full-time product. Use Terraform modules + good documentation instead.
Anti-pattern 03: A dedicated "developer experience" survey programme
At hyperscaler scale, DevEx surveys produce useful signal. At SMB scale, you have 8 engineers; talk to them in the kitchen. The survey overhead exceeds the data value.
Anti-pattern 04: A formal platform-team product backlog with sprints + retros
The hyperscale platform team is a product team. The SMB platform team is also doing infrastructure operations + security + everything else. The formal-product-team overhead conflicts with the operational rhythm. Keep the backlog visible (Linear / Jira / GitHub Projects) but skip the product-team ceremony.
The reference implementation for a 10-person IT team
The stack we ship for clients in this profile:
| Layer | Tool | Why |
|---|---|---|
| Source control | GitHub Enterprise | Universal; CI/CD; security tooling |
| CI/CD | GitHub Actions | Lives next to the code; sufficient for SMB workloads |
| IaC | Terraform + Atlantis | PR-driven workflow with plan in comments |
| Container orchestration | Hetzner-class K8s or Proxmox VMs | Depending on workload pattern |
| Secrets management | HashiCorp Vault (Enterprise or self-hosted OSS) | Centralised + audited + integrates with everything |
| Identity | Entra ID + SCIM | SSO into the whole platform |
| Observability | Prometheus + Grafana + Loki | Open source; portable; cheap at this scale |
| Internal portal | GitHub-as-portal initially; Port.io if needed | Avoid Backstage at this scale |
| Documentation | Markdown in Git + Docusaurus or MkDocs Material | Versioned, reviewed, deployed via CI |
| On-call | PagerDuty / Opsgenie / Grafana OnCall | Rotation management |
The "deploy a new service" golden path
The canonical workflow that drives ~60% of the platform's value at SMB scale:
Engineer's experience
- Clone the
service-templaterepository. It contains: Dockerfile, CI/CD workflow, Terraform module reference, observability instrumentation library, README template. - Rename to the new service name. Update the README. Push to a new repository.
- CI/CD runs immediately. Builds container image. Pushes to the registry. Validates IaC plan.
- Engineer opens a PR against the platform-config repo: a 5-line YAML file declaring the new service.
- Atlantis plans the deployment in the PR. Reviewer approves. Atlantis applies.
- Service is live. Prometheus discovers it. Grafana dashboard auto-generated. Alerts pre-configured.
- Total elapsed time: 60-90 minutes from "I need a service" to "it is running in staging".
What the platform team did once, that enables this every time
- The
service-templaterepository (one-time setup, ongoing maintenance) - The Terraform modules that the YAML declarations consume (one-time per module class)
- The Atlantis configuration for the platform-config repo (one-time)
- The observability auto-discovery patterns (one-time per environment)
- The grafana dashboard templates (one-time per service archetype)
The work is non-trivial but bounded. Estimate: 6-10 engineering weeks for the initial implementation. Pays back inside 6 months once the team is shipping >1 service per quarter.
The "provision a development environment" golden path
For internal IT contexts, this often matters more than service deployment. The pattern:
# engineer opens a PR with this file: requests/sarah-laptop-rebuild.yaml
requester: sarah.engineer@client.com
purpose: laptop rebuild after replacement
environment:
template: developer-laptop-2026 # the standard image
apps:
- vscode
- docker-desktop
- postman
- tailscale # for ZTNA
vpn_groups: [engineering, internal-services]
expires: never # for permanent reassignments
The PR triggers automation: Intune profile assignment, VPN group membership, license assignment, welcome email. Total time from PR-merge to laptop-ready: under 30 minutes for the automation, plus device-shipping time when relevant.
Documentation as a tested product
The structure we use:
docs/
├── tutorials/ # First-time experiences
│ ├── deploy-your-first-service.md
│ ├── set-up-local-dev.md
│ └── debug-a-failing-deploy.md
├── how-to/ # Task-oriented
│ ├── add-a-new-environment.md
│ ├── rotate-credentials.md
│ └── add-a-monitoring-alert.md
├── reference/ # Look-up
│ ├── terraform-modules.md
│ ├── ci-cd-pipeline.md
│ └── slo-definitions.md
├── explanation/ # Why things are the way they are
│ ├── why-we-chose-terraform.md
│ ├── golden-path-philosophy.md
│ └── service-classification.md
└── runbooks/ # Operational
├── platform-incident-response.md
└── disaster-recovery.md
The four-quadrant structure (Diátaxis) maps onto how engineers actually look for documentation. Tutorials for first-time; how-to for task; reference for lookup; explanation for context. Most wiki rot happens because the four categories are mixed and the docs become impossible to navigate.
The SLOs we typically set
| SLO | Target | Window |
|---|---|---|
| CI/CD pipeline P95 duration | < 12 min | Rolling 30d |
| Self-service environment provisioning success | > 95% | Rolling 30d |
| Documentation site availability | > 99.5% | Quarterly |
| Platform-team incident response (P2) | < 30 min to acknowledge | Rolling 90d |
| Onboarding time-to-first-deploy (new engineer) | < 1 business day | Per-engineer |
| Off-golden-path adoption rate | < 20% | Rolling 90d |
The SLOs are published. The platform team's quarterly review walks through SLO compliance. SLO breaches produce action items the team commits to.
The on-call piece
A platform team needs on-call. Three-person rotation minimum (so each person has 2 weeks off between shifts). The on-call covers:
- Platform availability (the CI/CD, the IDP, the observability stack)
- Critical infrastructure (the Kubernetes cluster, the Vault, the network backbone)
- Customer impact (a customer-facing service is degraded and the platform team is the escalation path)
The on-call does not cover individual application teams' incidents. Those teams own their on-call for their services. The platform team is responsible for the platform; the application team is responsible for the application.
The team composition reality
For a 200-seat company, the platform team is typically:
- 1 senior platform engineer (lead)
- 2 platform engineers (build + operate)
- 0.5 FTE security input (rotated from a security engineer)
- 0.25 FTE product input (rotated from an engineering manager)
That is roughly 3.75 FTE for the platform function in a 30-50 engineer total. The ratio is the cost of doing platform engineering at this scale. Cheaper than the alternative (every team rebuilds the same patterns and reinvents the same mistakes).
Measuring ROI
The three metrics we have found correlate with real productivity improvement:
- Lead time to deploy a new service. Median across the team. Pre-platform: 3-10 business days. Post-platform: 90 minutes - 4 hours. The number is concrete; the management ROI conversation is straightforward.
- Off-golden-path incidents per quarter. Services deployed without using the golden path produce more incidents because they skip the bundled observability + security + deployment controls. Tracking this directly proves the golden path's value.
- New-engineer time-to-first-deploy. Pre-platform: 1-3 weeks. Post-platform: under 1 business day. New-engineer productivity ramps faster; the recruiting case improves.
The failure modes we have seen
- Platform team becomes a bottleneck. Every change goes through the platform team. The team is the slowest path. The fix is more self-service, not more platform engineers.
- Golden path adoption stays low. Engineers ignore the golden path and reinvent. The fix is interview-driven understanding of why; the path itself is wrong.
- Documentation rots fast. The team writes docs once and never updates. The fix is treating docs as part of every PR review.
- Platform team chases hyperscaler patterns. Backstage, complex DevEx tooling, custom abstractions. The fix is ruthless prioritisation against measured ROI.
- Application teams resist self-service. "But we want IT to do this for us." The fix is making the self-service genuinely faster than the ticket route. If the ticket is faster, the self-service is wrong.
The one paragraph version
Platform engineering at SMB scale is not "doing what Spotify does with fewer people". It is choosing the 15% of the playbook that produces ROI at your scale. Four patterns work: a golden path for the most common workflow, self-service via small portal (GitHub-as-portal first, Backstage later), documentation as tested product, internal SLOs for the platform. Four patterns do not: Backstage with 30 plugins, custom IaC DSLs, formal DevEx surveys, product-team ceremony. Reference stack: GitHub + Actions + Terraform + Atlantis + Vault + Prometheus stack + Markdown docs. Team size: 3-4 FTE for a 30-50 engineer total. Measured impact: deploy lead time drops 78%, new-engineer productivity ramps in under a day, off-golden-path incidents fall significantly.
If you want this designed + implemented + handed over to your operating team, that is the engagement shape under our Intelligent Workflow Automation + Azure Cloud Infrastructure services. The shape is bespoke per client; the patterns above are the starting point.