Agentic AI in IT Operations: Real Use Cases Beyond the Hype
Eight production use cases where agents earn their keep, four where they reliably disappoint, and the five-component safety architecture that separates "useful tool" from "incident in production".
"Agentic AI" is the most over-loaded phrase in enterprise software right now. Vendors use it to mean anything from a system prompt with the word "agent" in it, to a fully autonomous workflow that books flights and signs contracts. This is the field report on where agents actually pay off in IT operations — and where they reliably fail.
We have shipped agentic AI into four production estates in the last twelve months. Internal IT helpdesk for a 600-seat scale-up. Operational triage for a Maritime logistics operator. Onboarding automation for a Maltese FinTech. Cross-system request handling for a corporate-services firm. The pattern that emerges is consistent: agents work brilliantly inside narrow, well-defined action surfaces, and fail visibly the moment the operating envelope gets stretched.
This article maps the territory. Eight production use cases where agents earn their keep, four where they reliably disappoint, and the safety architecture that makes the line between "useful tool" and "incident in production" survivable.
The mental model: what an agent actually is
Strip the marketing and an agent is three components:
- A model (LLM) that decides what to do next given the current state.
- A tool set — APIs the model can call to read state or take action.
- A loop — the runtime that lets the model take multiple steps before producing the final output.
That's it. The variations are how rich the tool set is (one API call vs forty), how disciplined the loop is (fixed number of steps vs open-ended), and how strong the safety boundary is (read-only vs unrestricted action). Every "agent" you encounter sits somewhere in that three-axis space. Understand which corner you are in, and the failure modes become predictable.
Where agents actually work in IT operations
Use case 01: Ticket triage and classification
What it does: Incoming ticket arrives in ServiceNow / Jira / Zendesk. The agent reads the title and body, classifies into queue + priority + tags, optionally drafts a first response, and assigns to the routing rules.
Why it works: The decision space is narrow (pick from a known list of queues), the consequence of error is small (a misrouted ticket gets re-routed by a human in five minutes), the volume is high (hundreds per day), and the LLM is genuinely good at the underlying task (classification of natural language).
Production results, Maritime client: 87% correct queue assignment on first pass (vs 71% with the previous rules-based router). Median time-to-first-response down from 4h to 18 minutes. Operator load on triage reduced by ~60%.
Use case 02: Password reset and access requests
What it does: User asks for a password reset via Teams / Slack / email. The agent verifies identity through your IdP, checks the user's role for the requested resource, and either executes the reset (low-risk paths) or routes to a human (elevated paths).
Why it works: The action whitelist is small and well-defined. Identity verification has a clean API. The cost of getting it wrong is bounded (worst case: a delayed reset). The cost of getting it right is high (senior engineer time freed from repetitive tickets).
Production guard: Password reset is auto-executed only for the user's own account, only after MFA reverification, and only outside the privilege threshold we configure per client. Reset for an admin account, an executive, or above-threshold privileges always routes to a human approver.
Use case 03: Configuration drift detection and remediation
What it does: An agent runs on schedule against your infrastructure — AWS, Azure, GCP — and identifies drift against the documented baseline (Terraform state, CIS benchmarks, your internal SOPs). Surfaces the drift, classifies severity, and for low-risk drifts, prepares a Terraform PR to remediate.
Why it works: The state is verifiable through APIs. The action (proposing a PR) is reviewable by a human. The cost of the agent being wrong is "an engineer reviews a PR that turns out to be unnecessary" — small. The cost of being right is "drift caught before it bites in production" — meaningful.
Use case 04: Internal helpdesk Tier-0
What it does: The "how do I…" tickets. Where is the VPN client? What is the policy on personal devices? How do I request access to the marketing tools? The agent answers from your knowledge base (RAG) and, where the answer requires an action (provisioning, ticket creation), executes the action within the whitelist.
Why it works: Volume is high, decision space is narrow (answer from corpus or escalate), users self-correct (if the answer is wrong they say so), and the alternative (humans answering the same question 200 times) is the worst use of expensive engineering time.
Result on FinTech scale-up: Tier-0 ticket volume into engineering dropped 64% in 90 days. Median resolution on resolved-by-agent tickets: 90 seconds. Escalation rate to human: 18% (down from a target of 30% as the corpus matured).
Use case 05: Onboarding orchestration
What it does: New hire event fires from the HRIS. The agent walks a checklist: provision the identity, assign group memberships based on role, ship the device with the right MDM profile, enroll in compliance training, schedule the welcome sessions. Where deviations from the standard are needed, the agent flags for human review.
Why it works: The happy path is well-defined and high-volume. Deviations are the exception, not the rule. Human review is reserved for the cases where it actually adds value (the unusual role, the edge-case location, the special access requirement).
Use case 06: Cross-system data lookups
What it does: A user in Slack asks "what's the status of order 47298?" — the agent calls the ERP API, the shipping API, the CRM, and assembles a coherent answer with citations to each source system. No data movement, no synchronisation pipeline, just a smart query layer.
Why it works: The systems already have the data. The user does not have time to log into three of them. The agent reads, summarises, and cites. The action surface is read-only, so the safety story is trivial.
Use case 07: Compliance evidence collection
What it does: Audit asks "show me evidence that we have MFA enabled on all admin accounts as of last quarter". The agent walks Entra ID / Okta / AWS IAM, collects the relevant configuration snapshots, formats per the auditor's template, and produces a defensible evidence pack.
Why it works: The data sources are stable APIs. The output format is repeatable. The work is dull, error-prone for humans, and high-stakes (a missed evidence item is a compliance finding). The agent never has to make a value judgement — it collects what it is told to collect and presents it.
Use case 08: Capacity planning advisor
What it does: The agent watches resource utilisation trends across the estate (cloud spend, license utilisation, storage growth, GPU usage) and surfaces recommendations: this team is over-provisioned, this SaaS contract is under-utilised at renewal, this storage tier is becoming expensive vs alternatives.
Why it works: The agent is an advisor, not an actor. It produces a structured recommendation. A human decides whether to act. The model is good at the underlying task (pattern recognition over time series with context). The user (a platform lead) is qualified to evaluate the recommendation.
Where agents reliably fail
The honest part. Four patterns where we have either tried and stopped, or watched clients try and stop.
Failure 01: Open-ended workflows with ambiguous goals
"Make my IT operations better." "Reduce our cloud costs." "Optimise our onboarding." These prompts feel like agent prompts but they are management goals, not tasks. The agent has no defined success criterion, no bounded action space, and no way to know when to stop. Production-wise, this means agents that ramble, take 50 steps to produce a mediocre answer, and burn cost without producing measurable output.
The pattern is to decompose. Take the management goal, break it into the 5-10 concrete tasks that would advance it, then deploy agents on the individual tasks. The agent is good at "classify this ticket"; it is bad at "improve our ticket handling".
Failure 02: Decisions that require judgement we cannot encode
"Should we approve this expense?" "Is this candidate a good fit?" "Should we extend this contract?" These decisions blend explicit criteria with tacit organisational knowledge that nobody has written down. The agent will produce a confident answer; the answer will be wrong often enough to be dangerous.
The pattern is to flip the framing. Instead of "agent decides", use "agent assembles the decision packet" — pulls the relevant policies, recent precedents, applicable data — and a human decides. The agent saves the human 80% of the gathering time; the human keeps the decision authority.
Failure 03: Actions on systems with unstable APIs
Some SaaS APIs are reliable contracts. Others change quietly, return inconsistent data, or have edge cases the documentation does not surface. An agent making decisions against an unstable API will produce inconsistent behaviour for reasons that look like model failure but are actually integration failure.
We have walked away from agent-mediated actions on two specific SaaS systems where the API behaviour was too inconsistent to build a reliable agent. Both cases were better handled by a human running a SaaS UI flow.
Failure 04: Multi-step actions where intermediate state matters
If the agent calls Tool A, then Tool B, then Tool C, and Tool B's output affects whether Tool C should be called — the agent's reasoning about that intermediate state has to be perfect. With current frontier models, "perfect" is not the right word. Mid-90s reliability is the right word.
For workflows where 96% reliability is sufficient (most IT operations), this is fine. For workflows where one failed sequence creates a downstream incident (provisioning workflows, financial transactions, security operations), 96% is unacceptable. The pattern is to constrain to 1-2 step actions and let a deterministic orchestrator handle the chaining.
The safety architecture that separates production from prototype
Every agent we ship has five mandatory components. Skip any one and the system is not production-grade.
Component 01: The action whitelist
The list of operations the agent is allowed to execute. Explicit, named, version-controlled, signed off by a human owner per action. The whitelist is the absolute boundary: an action not on the list cannot be taken, period.
Implementation: tool definitions in the agent runtime, with each tool's invocation logged with the inputs, outputs, and decision rationale. Off-whitelist actions are not "blocked" — they are not available to the model in the first place.
Component 02: Privilege threshold per action
Within the whitelist, each action has a threshold above which a human must approve. Reset a standard user's password? Auto. Reset an admin's password? Human approval. Grant access to a public folder? Auto. Grant access to anything tagged "Confidential"? Human approval.
The thresholds are explicit configuration, not buried in prompt text. The agent does not "decide" whether to escalate — it is told to escalate by the policy engine outside the model.
Component 03: Reversibility analysis
Every action is tagged as reversible or irreversible. Reversible actions can be auto-executed within threshold. Irreversible actions (delete, send-email, financial transaction) always require human approval, regardless of threshold.
Reversibility is not the same as "low impact". A "send an email" action is reversible in scope (you can email an apology) but irreversible in fact (the email was sent). We treat sending external email as irreversible by default; internal Teams / Slack messages as reversible.
Component 04: Audit trail with full context
Every agent action produces an audit record: which user invoked the agent, what was the input, what tools were called, what was the model's reasoning between calls, what was the final output. Stored in Langfuse (or equivalent), retained per your compliance regime, queryable for incident response.
This is non-negotiable for regulated environments. DORA Article 28, NIS2 Article 21, GDPR Article 30 all expect this level of evidence. The agent that cannot produce this audit trail is not deployable in a regulated tenant.
Component 05: Refusal and escalation handling
The agent must have a documented path for "I do not know" and "this is outside my whitelist". Both paths produce a clean user response (not a hallucination) and route to a human queue. The escalation queue has SLAs and named owners.
The metric we watch is the escalation rate. A healthy agent has an escalation rate between 10% and 25% depending on use case. Below 10%, the agent is probably over-confident on edge cases. Above 25%, the whitelist is too narrow and the agent is not earning its cost.
Cost modelling for agents specifically
Agent cost is dominated by the loop. Each step is a model call. A 5-step agent costs 5x what a single classification call would cost. Two cost levers matter most:
Lever 01: Maximum step budget
Hard-cap the number of steps per agent invocation. Default 5. Tasks that consistently hit the cap are wrong-shaped for the agent — break them down or move them to a deterministic orchestrator.
Lever 02: Model tier per step
The thinking steps (classification, routing) can run on Haiku-tier or smaller. Only the final answer step needs the frontier model. We typically split the work: a Haiku step decides the action; a Sonnet step produces the user-facing output. Cost reduction vs all-Sonnet: 60-80%.
The build-or-buy decision
For each agentic use case, the decision tree is short:
- Buy a vertical agent product if your use case is well-served by an existing vendor (e.g., AI helpdesk products, IT-specific copilots). The vendor has done the safety work and the integration work. You pay a per-seat premium and lose customisation flexibility.
- Build with a managed-agent framework (Claude with tool use, Microsoft Copilot Studio, custom code with an orchestration library) if your use case requires deep integration with your internal systems and the safety surface matters. Engineering cost is meaningful; control is total.
- Use HOIST if you want the "build" outcome without the engineering cost. Our productised internal-operations agent is exactly this pattern, with the safety architecture above built in. HOIST Implementation for the bespoke variant, HOIST product for the standardised SaaS.
The one paragraph version
Agents work brilliantly inside narrow action spaces with clear success criteria, deterministic APIs, and explicit human-in-the-loop escalation. They fail visibly on open-ended workflows, judgement-heavy decisions, unstable APIs, and multi-step chains where intermediate state matters. The safety architecture (whitelist, privilege thresholds, reversibility, audit trail, refusal paths) is what makes the difference between a useful tool and an incident waiting to happen. Build the eight use cases above; avoid the four failures; ship the five-component safety stack; and the technology earns its keep.
If you want a quick scoping conversation on your specific operation, the discovery call is free and we will tell you which two of your top ten pain points are agent-shaped and which two are not. Most teams come in with the order reversed.