Everyone says they “do AI” now. It’s like driving. Most people can drive to the shops. Fewer can fix the car. Very few can keep a race car on the track when it gets twitchy at high speed.
Enterprise AI is the same, and executives are flooded with AI demos. This article strips the noise away. It explains agents in plain language, shows where they fit, and keeps your approvals, budgets, and ownership intact. You get a one‑page architecture, a scoreboard you can defend, eight real risks with fixes, and a 90‑day plan. The outcome is not an AI story. It is fewer clicks, shorter queues, stronger controls, and proof on the board.
Why Now?
Budgets are tight, and risk appetite is not. Agents can remove friction or create a mess. The difference is governance and design, not magic.
First Principles — The “Brain” Explained in Plain Language
The center is a text brain. Text in and text out. That is it.
No ledger inside. No hidden spreadsheet. Just a fast pattern engine that reads words and writes words.
So we build the rest around it: Context: a short briefing we hand to the brain so it knows the situation. Memory: enough recent history to make the next step make sense. Planner: turns a goal into steps, written as ordinary text. Tools: real functions in your systems. The brain can ask for a tool by name; your software does the work. Guardrails: limits, approvals, and checks. Keeps clever from becoming costly. Observability: what was asked, which context we supplied, which tools ran, what it cost, and what happened.
When people say “agentic AI,” they often mean that loop. The brain plans and explains. The software does the work. Most of the effort is still engineering, security, data, and integration. In other words, the same craft we already know.
What We Learned in Practice
- The first time we connected a tool menu, the names were too cute. We renamed them in plain language. Fewer mistakes.
- The memory window was too long. Answers rambled. We shortened it. Answers got sharper.
- A “helpful” auto-approval slipped into a test flow. We removed the code path and added a hard gate.
From Prompt to Action — How the Flow Really Works
- A user states an outcome
- The planner sketches steps
- We fetch background material (the brief)
- The brain proposes the next move in text
- If work is required, it asks for a tool
- Our system runs the tool and returns the result as text
- The loop continues until done
- If a step is sensitive, we stop and request approval. That is a normal control, not new magic
You can keep the approvals and maker‑checker rules you already trust. You simply hang them onto this loop. Nothing gets waived because “it’s AI”.
What We Learned in Practice
- We added a simple rule: the agent proposes, people approve, systems execute. It reads old‑ It works.
- Every round writes a run manifest. Who asked, what context, what tools were used, what the cost was, and the outcome. The manifest saved us twice in the first month.
The Car vs the Race Car
Everyone can use a general assistant. That is driver mode. Useful for questions, not a product.
Mechanic mode is different. You build the system: the data, the tools, the logs, the guardrails. That is where reliability and return on investment show up.
Race driver mode is the next step. Performance tuning under pressure. Clear limits. Cost under control. Incidents handled at 2 a.m. without drama. Very few teams operate here.
Your job as sponsor: make sure we are building the car, not just admiring the dashboard.
Business Case – Same Rules, Higher Bar
Do not lose your head just because it is AI. We still move a number that matters.
- Problem: explain it in plain language
- Target metric: cycle time, straight‑through rate, exception handling time, cost per case – pick one and baseline it
- Cash maps: show the line from that metric to money or risk
- One scorecard: shared across business and tech. No split KPIs
- Cost discipline: real‑time usage, per‑use‑case budgets, and a visible “cost of answer”
- Definition of done: live, used, measured, and supported. Anything less is a demo
How We Measured the AI Initiative
We used a simple sheet with a baseline on the left, a target in the middle, and the actual condition on the right. It was reviewed weekly and signed by the owner. The targets that were met were noted, and an explanation was provided for those that were not.
Governance That Actually Moves the Needle
Roles that matter (and what happens if they vanish):
- Sponsor: carries the mandate and shows up. When sponsors go missing, programs drift.
- Product owner: owns the number that will be moved this quarter. Makes decisions promptly.
- Domain owners: own the data and source systems. They approve tool scopes.
- Architecture and security: own the rails and exceptions. Fast track exists; free passes do not.
- Operations: own run‑ If something breaks, they know what to roll back.
- Change lead: runs embedment as a design stream. Adoption is not an afterthought.
Cadence that keeps it honest:
- Weekly: adoption, value, exceptions, cost, incidents
- Fortnightly: tools, prompts, policies — decisions, not theatre
- Monthly: steering for approvals and funding. No surprises
Main Sponsor Concerns
- Do we keep approvals: Yes, for high‑risk moves
- Can it post to core systems: Not without a tool that we approve
- Who gets the 2 a.m. call: They have a run book
- What happens if the brain writes nonsense: The system catches it and asks for help
- How do we stop cost spikes: Per‑run budgets and timeouts. We see cost on the same scoreboard as value
Data Readiness — Why the Experience Lives or Dies Here
We love to talk about “intelligence.” We skip the part that breaks it: messy data.
If customer names are in five formats, the assistant will ask the same question twice. If statuses are a mix of codes and free text, the agent will flag the wrong case. If two systems use the same word for different things, your explanation will sound confident and still be wrong.
Practical fixes:
- Decide what an ID is. One way. Everywhere. No exceptions.
- Normalize names and addresses upstream. Do not throw that at the agent.
- Make states and reasons a clean list, not “type whatever you want.”
- Build one small dictionary per domain. Translate old terms to new ones at the edge.
- Keep a freshness flag on source data. If the brief is stale, say so in the output.
What We Learned in Practice
- We killed five minutes of every call by reusing the same three data points. Simple reuse. Big lift in trust.
- We cut a whole class of errors by enforcing a single date format and rejecting everything else.
A Simple Reference Architecture (Board‑Ready)
- Orchestrator: manages the loop, sets limits per run, and records a manifest.
- Context services: fetch and prepare background material; test them as you would any data pipeline.
- Tool menu: only the functions the agent may call. Narrow scopes. Rate limits.
- Identity and roles: the agent runs with a specific identity. If it needs more authority, it asks, and policy decides.
- Approvals: high‑risk actions stop and wait for a person.
- Observability: logs for prompts, context, tools, latency, cost, and outcome.
- Versioning: prompts, policies, tools — all versioned like code.
- Tenancy: data and memory stay inside the right boundaries.
Understandable. Auditable. On one page.
Make the Risks Real — Eight You Can Recognize Today
- Hidden Instructions in Content
A policy file included a “footnote” instructing readers to send data to an email address. If you pass that straight through, the brain might try to send an email.
Fix: strip or sandbox untrusted inputs. The agent never sends an email directly. High‑risk actions require a person.
- Over‑Broad Tools
A customer‑address tool also allowed closing an account because it was convenient years ago.
Fix: one tool, one purpose. Narrow permissions. Rate limits. Peer review on scopes. Audit every call.
- Quiet Data Drift
A confident answer used in last month’s policy looked fine, but was wrong.
Fix: evaluation for your retrieval step, freshness checks, and visible citations so reviewers can see the source.
- Budget Loops
A plan kept retrying a failing step, burning tokens and time.
Fix: per‑run budgets, timeouts, and loop limits. Cost shows up in the log where everyone can see it.
- Version Drift
A prompt tweak shipped without tests. Accuracy fell.
Fix: treat prompts like code. Version. Review. Roll back and link incidents to exact prompt versions.
- Leaky Memory
Old tenant data surfaced in a new run.
Fix: hard tenant boundaries, encryption at rest, short memory windows, and deletion tests with live data.
- Tool Poisoning
A shared function was modified upstream and started doing more than it said.
Fix: lock tool contracts, scan for changes, and fail closed if a signature does not match.
- Ownership Gaps
No one owned incident response.
Fix: name the owner. Write the run book. Practice the drill.
None of these fixes are exotic. They are the same disciplines in a new loop.
Cost and Capacity — How We Kept the Bill Sane
People expect model prices to fall forever. Then they watch an agent chain call ten tools, retry twice, and send the bill north.
Practical levers that worked:
- Pick the smallest brain that passes your tests. Use the big one for the hard step only
- Cache safe repeats for a short time. Not secrets. Just common look‑ups
- Put a budget on every run. If it hits the ceiling, it stops and asks for help
- Kill loops fast. The planner only gets so many attempts
- Expose “cost of answer” on a simple page. Name it. Fix it
- Run reports by use case and team, not just by month. That is where you see patterns you can change
What We Learned in Practice
- We cut costs by a third by splitting one big plan into two smaller ones with a hand-off.
- A 5‑minute caching rule removed a painful spike in a busy hour. No one noticed, except finance.
Evaluation — How We Stay Honest
We do not test once and move on. We test all the time.
- Golden prompts: a small set that must pass before any changes ship.
- Golden docs: a small pack of source files we know well. The agent must cite the right parts.
- Domain checks: a handful of edge cases where we failed before. We keep them close.
- Human review: short and fast. Two pairs of eyes for risky flows.
- Post‑deploy watch: the first 48 hours after a change are always noisy. We staff for it.
- Stop rules: if accuracy or cost moves in the wrong direction, we stop, roll back, and review the logs.
Incident and Change — What Happens at 2 A.M.
Something will go wrong. Plan for it.
- One number to call
- First move: stop risky tools
- Second move: roll back the last prompt or tool change
- Third move: replay the failing run in a sandbox and read the manifest
- After action: write it down. Fix the root cause. Add a test
Basic saves nights.
MCP Adapters vs Custom Integrations — A Balanced Take
There is a growing pattern: a small server that exposes tools to assistants in a standard way. Think of it as an adapter. It can speed up discovery and reuse. The new standard that has emerged is MCP (Model-Context-Protocol).
Good for: low‑risk reads, simple writes, quick prototypes, and central reviews.
Less good for: anything involving money, identity, or sensitive data. Defaults are not your policy.
Practical Rule:
- Low‑risk, generic actions — use an adapter / MCP server; it is tidy and fast.
- High‑risk or deeply specific actions — keep your own service layer with your scopes, your logs, your approvals.
- Hybrid works well: adapter for read‑only, custom for act‑with‑
Either way, every tool is a new trust boundary. Treat it as such.
Case Vignette 1 — Onboarding that Stopped Tripping Over Itself
We had a simple goal on paper: shorten onboarding. It was messy in practice. Every unit asked the same questions in a slightly different way. People sent the same document twice. Names were typed three times. You know the story.
We started by drawing the experience from the customer’s side. Then we made the data follow that. One ID format. One source of truth for names. A tiny dictionary to translate old field names into the new ones. Boring work.
Now, with that foundation in place, we can add agents into the process to handle specific steps. This reduces people’s effort but does not remove people entirely — they are still there for escalations, quality control, and monitoring. It is still the traditional process, traditional solutions, traditional controls. The difference is that some steps are simply better suited to agents. The secret is to give agents very small, tightly bound tasks, not the entire process.
What We Learned
I remember a smaller project we delivered years ago. The challenge was a complex multi‑step process in which, out of 100+ team members, only 2-3 had the experience to perform each step. They had been there for years and had learned the nuances and exceptions by working across each stage. Each step had its own “training course” and felt completely different from the next.
Staff churn was high. Average tenure was 3-6 months.
We solved it by creating a production line — the same way Henry Ford built cars. New team members started with no knowledge, but we quickly trained them on a single specific step. Workflow systems made this possible.
That is the picture for agents as well. Treat each agent like a station on the line. Finely tune it for one task, and it will do that task exceptionally well. Try to give an agent 10-20 different tasks, and you will have problems.
Case Vignette 2 — Reconciliation Without Drama
On paper, reconciliation sounds simple. In reality, it is late nights, mismatched fields, timing differences, and a pile of exceptions no one enjoys.
If I were introducing agents here, I would not start with “close the books faster.” I would start with one narrow slice: help the team clear noise so humans can focus on the few items that actually need judgment.
We would keep the existing process and controls. Same makers and checkers. Same approvals. The change is that a few steps get a tireless helper.
Where Agents Fit Best
- Read the ticket, gather the context, and summarize it in a single short note
- Pull yesterday’s statement and the ledger extract for the same window
- Line up columns the same way every time — IDs, amounts, dates, currency, reference
- Do a first pass match on safe rules (exact match, known offsets, same‑day reversals)
- Tag timing differences and duplicates. Propose a reason if it is obvious
- Draft the follow‑up note or journal template for review, but never post it
- If any threshold is crossed, pause and ask for approval
That is it. No heroics. Just fewer clicks and clearer piles.
Controls That Stay in Place
- The agent cannot move money
- The agent cannot post journals
- Anything above a set threshold needs a person
- Every run writes a manifest: who asked, what data was used, what rules applied, what it cost, what changed
- Tool scopes are narrow. One tool, one job. If we need a second job, we make a second tool
What We Would Measure
- Percentage of items auto‑classified as “no action” with reviewer sign‑off
- Time to first decision on an exception
- Straight‑through rate for safe matches
- Rework rate after review
- “Cost of answer” per reconciliation case
If the numbers do not move, we stop and fix the data or the rules before adding anything new.
Rollout Plan (Lightweight)
- Week 1: pick one source system and one counterparty. Name the owner. Baseline the current time per case.
- Week 2: wire up read‑only pulls and build two or three safe matching rules.
- Week 3: draft‑only mode. The agent prepares matches and notes. Humans review and send back corrections.
- Week 4: add two more rules and a simple threshold. Anything above the line pauses for approval.
- Week 5: expand to a second counterparty. Keep the same rules.
- Week 6: review the scoreboard. Keep what helps and drop what does not.
Pitfalls to Avoid
- Letting one convenient tool do too much. Split it.
- Mixing reconciliation rules with free text inside prompts. Keep rules outside and test them.
- Hiding the source. Always show the link or the extract line so a reviewer can check it in two clicks.
- Treating timing differences as errors. They are not. Tag them and move on.
If We Wanted to Go Further Later
- Add a small catalog of known patterns: weekend posting delays, currency cutoffs, batch timing, and common duplicates.
- Teach the agent to propose a journal line with the right cost center and reason code — still draft‑
- Add a daily “top 10 exceptions” note to help leaders see drift early.
This is not about automating judgment. It is about clearing the mud so people can see the real rocks. The team keeps control. The numbers tell us if it is working. And the work gets less painful, one safe step at a time.
Team Shape — Build the Pit Crew
You do not need a stadium of specialists. You need a small crew that can ship.
- A product owner who can say no
- A business lead who owns the process and the number
- A prompt and context designer who speaks on both sides
- A software engineer who builds the tool menu and guardrails
- A data lead who fixes the inputs that keep breaking the experience
- A change lead who gets adoption over the line
Hire for fit and calm under pressure. Teach the rest.
A 90‑Day Path That Respects Risk and ROI
Weeks 1‑2: Pick two use cases
- One “answer” case grounded in your documents with citations
- One “action” case with a few safe tools
- Baseline the metric. Map to cash
Weeks 3‑4: Build the spine
- Identity, scopes, approvals
- Run‑manifest logging
- Per‑use‑case budgets
- A small evaluation suite
Weeks 5‑8: Ship a supervised pilot
- Show sources
- Require approvals on risky steps
- Measure adoption and the target metric
Weeks 9‑12: Harden and expand
- Add an adapter for low‑risk reads if useful
- Extend evaluation with edge cases
- Take it through change control and hand it to operations with a named owner
Rollout Playbook — The First Ten Days After Go‑Live
- Day 1: a short stand‑up with users. Remind them what to try first
- Day 2: ride‑ Sit with them. Fix papercuts on the spot
- Day 3: publish the first tiny metric. Do not wait for perfection
- Day 4: patch prompts that confuse people. Small edits
- Day 5: check cost. Kill loops
- Day 6: sit with risk and audit. Walk them through one full run, end to end
- Day 7: celebrate one saved hour. People remember that
- Day 8: add one safer tool
- Day 9: publish the second tiny metric
- Day 10: hand back to the team. Stay close for another week. Then let them run
Sponsor Checklist (Bring This to Steering)
- Owner named: who signs for value
- Metric & baseline: what number drops (or rises), by when
- Budget guard: per‑run limit and monthly ceiling in place
- Tool menu: list of allowed actions, each with a narrow scope
- Approvals mapped: which steps pause for a person
- Run manifest: can we reconstruct a run in 2 minutes
- Data freshness: how we stop stale briefs
- Stop rule: what triggers rollback. Who does it
- Adoption plan: who trains whom, and how we will know they are using it
- Hand‑off: which team owns it on Day 11
Closing Thought
Most people can ride a bike. Only a few drop into a steep line at speed and stay upright. Agentic AI is the same. Start small. Pick a case that matters. Tune the engine. Build the pit crew. Learn the corners. Then add pace.
Do that, and you will not just have an “AI story.” You will have work that moves faster, with stronger controls, and a scoreboard that proves it.







