User-scoped to the bookkeeper running it; org-scoped on customer payment history (which arguments worked, what tone resonates with which AR contact, what's been promised).
Anthropic's Managed Agents on platform.claude.com is the production runtime for agents — memory, tool execution, code interpreter, file handling, MCP wiring, and observability operated for you. The pitch: trade some architectural control for ~80% of the time-to-production. This page is for the CTO or VP Eng deciding whether that trade is right for your team, your stack, and your roadmap. The decision matrix, the seven-part architecture, three real reference builds, and the production concerns most teams discover three months in.
platform.claude.com/docs at build time. The architectural decisions in this page are stable; the syntax may not be.
The most expensive engineering mistake in 2026 isn't picking the wrong agent framework — it's picking the wrong layer of abstraction for what you're trying to build. Four real options. They serve different problems. Here's the honest comparison.
Anthropic now offers four meaningfully different products that solve overlapping but distinct problems. The naming makes them sound like a hierarchy; they're actually a portfolio. Choose by use case, not by ambition.
| Layer | What it is | Best for | Watch out for |
|---|---|---|---|
| Claude (chat) claude.ai | The end-user product — humans typing into a chat interface, with Projects, Skills, integrations. | Internal team use, Claude Projects for SMB workflows, the small-business cookbook pattern. non-engineers | Not a developer surface. You can't programmatically invoke it; you can't ship it inside your product. |
| Claude Code CLI / IDE | Anthropic's coding agent in the terminal. Built for engineers working on a codebase, locally, with hooks, slash commands, MCP, sub-agents. | Internal developer productivity. Building & maintaining your team's codebase. Internal automations engineers run themselves. eng leverage | Wrong tool for production agents your users interact with. Not deployable as a service. |
| Claude API api.anthropic.com | Raw model inference. You bring everything else — memory, tools, persistence, evals, observability, retries, rate limiting. | Single-call use cases (classification, extraction, summarization). Bespoke architectures where you need full control. infra-heavy | You're rebuilding the agent runtime from scratch. Time-to-production: months, not weeks. Maintenance forever. |
| Managed Agents platform.claude.com | Anthropic-operated agent runtime. You define the system prompt, tools, memory model, MCP servers, guardrails. They run it, persist it, scale it, observe it. | Production agents your users interact with. In-product chat, support, recommendation, document Q&A. Internal agents replacing tier-1 ops work. production-ready | Less architectural control than rolling your own. Pricing model can surprise you at scale. Vendor concentration risk if Anthropic changes the product. |
If you're shipping an agent your customers or employees interact with and you don't have a strategic reason to own the agent runtime — Managed Agents. If you're doing single-shot inference at scale (classification, extraction, content generation) — API directly. If you're making your engineering team faster on your codebase — Claude Code. If you're enabling non-engineers in your business to use AI on documents and workflows — Claude (chat) with Projects. Most companies end up using three of the four. The question for any given build is which one this build belongs to.
Five honest reasons not to choose Managed Agents even when it looks like the right fit:
The honest read. Managed Agents is correct for the majority of small-to-mid-sized companies shipping agentic features in 2026. Not because the alternatives are bad, but because most engineering teams underestimate by 5–10x what it costs to operate a production agent: memory, persistence, retries, observability, evals, prompt versioning, cost monitoring, abuse detection, audit logs. Anthropic is operating that stack at scale. Reaching for it earlier than you think you should is usually the right call.
A Managed Agent is not a model call. It's a system with seven components — each of which has architectural decisions that determine whether your build works in production or only in demos. Get these seven right and the agent will hold up. Get any of them wrong and you'll be debugging the wrong part of the stack for a quarter.
The agent's job, voice, tone, constraints, and refusal behavior. Iterate this with evals (component 07) — not vibes.
This is where most agent quality lives. A great system prompt with mediocre tools beats mediocre prompt + great tools, every time.
Three scopes that matter: session (this conversation), user (this human across sessions), organization (everything the agent learned about this customer or tenant).
The wrong scope is the most common architecture mistake. Putting org-level facts in session memory means the agent forgets your business. Putting session-level state in org memory means it leaks across conversations.
Three layers: native platform tools (web search, code, files), custom function-calling tools you define inline, and MCP servers you connect (your DB, your API, your CRM).
The tools an agent has access to are the surface area of what it can do. They're also the surface area of what it can do wrong. Every tool is a permission boundary that needs to be designed deliberately.
Sandboxed Python (and increasingly JS) runtime the agent can use to compute, transform data, render charts, or run validation logic. Underrated capability.
The "compute" leg of the stool. A document-Q&A agent without code exec can answer questions; one with code exec can also do math on the document, generate the chart, and validate its own answer.
How files enter the agent (user uploads, integration pulls, generated artifacts) and how they leave (downloads, side effects to your system, returned in responses).
The most under-thought-about part of agent builds. File flow is where data leakage, large-context cost blowups, and unexpected processing failures happen.
Input filtering (block PII / prompt injection / abuse), output filtering (block hallucinations on critical facts, redact sensitive data), and refusal behavior (when the agent declines and what it says).
Not a layer you bolt on at the end. Guardrails are part of the agent's identity. Decide them while writing the system prompt.
Tracing every turn (system prompt + memory state + tools invoked + tokens + cost + latency), running evals on every prompt change, alerting on drift.
The discipline that separates demos from production. If you can't replay a bad conversation and explain what happened, you don't have a production agent.
Everything beyond the runtime: how your product calls the agent, how it receives streamed events, how it surfaces tool approvals to users, how it stores conversation references in your DB.
Managed Agents operates the agent runtime; you still own the integration surface in your product. Plan for it.
Conceptually, defining a Managed Agent looks roughly like this. Names, parameter shapes, and event types will vary — verify against current docs — but the components are stable:
import {#123; Anthropic }#125; from "@anthropic-ai/sdk"; const client = new Anthropic(); const agent = await client.agents.create({#123; name: "collections-agent-v3", // 01 — System prompt & persona system: `You are an AR collections specialist for Acme Corp. Your job: chase past-due invoices firmly but warmly. Voice: direct, no corporate hedging, never threaten. Always cite the specific invoice number and amount. Escalate to a human for accounts > 90 days or > $25k.`, // 02 — Memory scopes memory: {#123; session: {#123; enabled: true, ttl_hours: 24 }#125;, user: {#123; enabled: true, scope: "customer_id" }#125;, org: {#123; enabled: true, scope: "acme_corp" }#125; }#125;, // 03 — Tools & MCP tools: [ {#123; type: "native", name: "web_search" }#125;, {#123; type: "function", definition: getInvoiceByIdSchema }#125;, {#123; type: "function", definition: draftChaseEmailSchema }#125;, {#123; type: "function", definition: markEscalatedSchema, requires_approval: true }#125; // write tool — gated ], mcp_servers: [ {#123; url: "mcp://quickbooks.acme.internal", scopes: ["read:ar"] }#125; ], // 04 — Code execution code_execution: {#123; enabled: true, runtime: "python-3.12", libraries: ["pandas", "numpy"], max_seconds: 30 }#125;, // 05 — File handling files: {#123; accept_uploads: ["application/pdf", "text/csv"], max_file_size_mb: 25, retention_days: 30 }#125;, // 06 — Guardrails guardrails: {#123; input: {#123; block_pii: true, prompt_injection_filter: "strict" }#125;, output: {#123; block_hallucinated_amounts: true, redact_ssn: true }#125;, refusal: {#123; tone: "professional", escalation_contact: "ar-lead@acme.co" }#125; }#125;, // 07 — Observability observability: {#123; trace: "verbose", cost_alert_per_session_usd: 0.50, latency_alert_p99_ms: 8000, eval_set: "collections_eval_v3" }#125; }#125;);
Read this code as architecture, not as syntax. The specific field names, nesting, and SDK methods will differ from what platform.claude.com publishes today. The seven concepts — system prompt, memory scopes, tool layers, code execution, file handling, guardrails, observability — are what every Managed Agent build configures, regardless of how the API surfaces them this quarter.
Three builds, ordered from lowest to highest brand risk. Each shows the architectural decisions for a real production agent — not toy examples. Internal first (because tier-1 ops replacement is the highest-ROI starter project), customer-facing second (the bulk of the long-term value), frontier third (the build that pays for the consultant who helps you avoid the mistakes).
Reads QuickBooks AR aging daily, drafts chase emails ranked by likelihood-to-pay × dollar value × days late, posts the queue to your bookkeeper for one-click approve-and-send. Handles soft-reply triage. Escalates the hard accounts to a human with full context. Replaces ~30% of a half-time bookkeeper's week. Brand risk: low — emails go through human approval before sending.
User-scoped to the bookkeeper running it; org-scoped on customer payment history (which arguments worked, what tone resonates with which AR contact, what's been promised).
MCP server: QuickBooks (read). Functions: draft_chase_email, get_customer_history, mark_for_human_review (gated). Native: code exec for AR-aging math.
Never threaten. Never quote a dollar amount the agent can't trace to a real invoice. Always cite invoice number. Escalate > 90 days or > $25k. Block any send that the agent's own confidence-on-draft score is below 7/10.
Slack — the bookkeeper opens a DM with the agent every morning at 8am, runs /aging, gets the ranked queue, approves and sends with reactions. Invocation is async-streaming; results stream back in 10–20 seconds.
Why this is the right starter build. Low brand risk (emails reviewed), high measurable ROI (days-to-collect on AR drops 15–25% in the first quarter), narrow scope (one tool, one user, one job). The agent has fewer than a dozen tool calls and the eval set fits in a single spreadsheet. If you can't ship this in 4 weeks of engineering effort, you have a process problem, not a Managed Agents problem.
// Slack slash-command handler: /aging export async function handleAgingCommand(req: SlackRequest) {#123; const session = await client.agents.sessions.create({#123; agent_id: "collections-agent-v3", user_id: req.user_id, metadata: {#123; trigger: "slash_aging", channel: req.channel_id }#125; }#125;); const stream = await client.agents.sessions.invoke({#123; session_id: session.id, message: "Pull today's AR aging. Rank top 10 chase candidates.", stream: true }#125;); for await (const event of stream) {#123; switch (event.type) {#123; case "tool_call_started": postEphemeral(req, `🔍 ${#123;event.tool_name}#125;…`); break; case "tool_call_completed": // stream progress to the user break; case "approval_required": postApprovalBlock(req, event.proposed_action); break; case "final_response": postRichResponse(req, event.content); break; }#125; }#125; }#125;
What the bookkeeper sees: a ranked list of 10 AR chases, each with the suggested email draft, the customer's payment-pattern history, and Approve / Edit / Skip buttons. They process the queue in 12 minutes; the previous manual version took 90.
Embedded in your SaaS product. Customers upload their documents (contracts, reports, datasets) and ask questions. Agent answers, cites, computes, and produces artifacts (summary, redline, chart). Memory is user-scoped so the agent remembers what they've uploaded and what they've asked. Brand risk: medium — agent voice is your product's voice; wrong answers are your product's mistakes.
User-scoped for uploaded documents and conversation history. Session-scoped for the in-flight workflow state (currently analyzing X, comparing against Y). No org-scoping — strict tenant isolation is the architecture.
Native: file read, code execution (pandas, matplotlib), web search (gated by user setting). Functions: extract_clauses, compare_documents, generate_redline, render_chart.
Never claim a fact not present in the user's documents. Always cite the page/section. Refuse questions outside the scope of the uploaded materials (no general-knowledge Q&A drift). Redact any PII the user uploads from the response surface.
A chat sidebar inside your SaaS app. Invocation is async-streaming over WebSocket. Tool approvals (e.g., "send this redline as a comment to my collaborator") surface as inline UI affordances, not modals.
Why this is the build that defines your product's AI reputation. Internal agents have a forgiving audience. Customer-facing agents do not. The cost of one viral screenshot of a confidently wrong answer is higher than the cost of a quarter of engineering work to harden citations, refusal behavior, and tenant isolation. Spend the quarter. Then ship.
export async function getOrCreateAgentSession( user: AuthenticatedUser, conversationId: string ) {#123; // Tenant isolation enforced at session creation. // Memory scope is BOUND to user_id — Anthropic enforces this server-side, // but we also encode it explicitly so any drift fails loudly. const session = await client.agents.sessions.findOrCreate({#123; agent_id: "doc-qa-prod", external_id: `${#123;user.tenant_id}#125;:${#123;user.id}#125;:${#123;conversationId}#125;`, user_id: user.id, memory_scope: {#123; user: user.id, tenant: user.tenant_id, conversation: conversationId }#125;, metadata: {#123; tenant_id: user.tenant_id, plan_tier: user.plan, // hard refusal for documents flagged as restricted document_class: await classifyDocsForUser(user.id) }#125; }#125;); // Assert invariant: this session's memory is NEVER accessible to // another tenant. If we ever ship a code path that bypasses this, // the entire product is broken. if (session.memory_scope.tenant !== user.tenant_id) {#123; throw new TenantBoundaryViolation(session.id); }#125; return session; }#125;
What your customer sees: a chat that knows their documents, answers with footnoted citations, can render a chart inline, and refuses politely when asked something outside scope. What you see in your traces: ~$0.08/conversation, p95 first-token latency under 1.2s, citation accuracy above 96% on the eval set.
A multi-tool agent that does the work of a $70k junior ops hire across calendar, email, ticketing, and finance. Lives in Slack. Owners DM it with operational requests ("schedule the team offsite for week of June 8 with these constraints"; "reconcile last week's expenses against the budget"; "draft the proposal for [client] using our template"). It plans the work, executes through MCP tools, surfaces approval gates for anything write-shaped, and learns the business's patterns over time. Brand risk: high — this agent has more tool surface than the other two combined.
Org-scoped (the business this agent serves). User-scoped per requester. Session-scoped per task. Long-lived org memory holds the business's preferences, standards, and the "what I'm avoiding" list from the small-business playbooks.
MCP servers: Google Calendar (read+write), Gmail (read, draft-only), HubSpot (read+write), QuickBooks (read+write, gated), Drive (read+write). Functions: ~25 custom tools across scheduling, comms, finance, and project management.
Tiered approval gates: read operations free; calendar invites auto-send with audit log; emails draft-only; financial writes require owner approval. Hard refusal on: HR-shaped requests, legal-shaped requests, anything involving termination. Daily summary of every action taken posts to a private owner channel.
A Slack workspace bot. Async by design — the owner DMs and goes back to work; the agent posts back when done, sometimes minutes later. Long-running tasks (a multi-step scheduling negotiation across 6 people) can stretch over hours with periodic status updates.
This is the build that consultants charge for. The architecture is straightforward; the discipline isn't. The fractional ops agent has to interact with five external systems, hold organizational context across days, and maintain the right approval gates so an owner trusts it. Most teams ship a v1 that's exciting in demo and broken in week 3. The break is always the same place: insufficient guardrails on write tools, plus drift in the eval set as scope creeps. Build this last, with someone who's done it before, after the other two are in production.
// Approval routing for write-capable tools. // The agent proposes; the owner approves; the system executes. export async function handleApprovalRequest(event: AgentEvent) {#123; const action = event.proposed_action; // Cheap reads — auto-approve and log if (action.tool.access === "read") {#123; return client.agents.approve(event.id); }#125; // Calendar writes — auto-approve, but audit + reversible if (action.tool.name === "calendar.create_event") {#123; await auditLog({#123; event, decision: "auto_approved" }#125;); return client.agents.approve(event.id); }#125; // Outbound comms — DRAFT ONLY, ever. No exceptions. if (action.tool.name === "email.send") {#123; return client.agents.reject(event.id, {#123; reason: "Outbound email must be sent by a human.", alternative: "email.create_draft" }#125;); }#125; // Financial writes — owner approval required, real-time if (action.tool.category === "finance" && action.tool.access === "write") {#123; const decision = await requestOwnerApprovalViaSlack({#123; action, context: event.context, timeout_seconds: 300 }#125;); return decision.approved ? client.agents.approve(event.id) : client.agents.reject(event.id, {#123; reason: decision.reason }#125;); }#125; // Anything not explicitly routed: HARD REJECT. // Better to fail loudly than to ship an over-permissioned agent. return client.agents.reject(event.id, {#123; reason: "No approval policy defined for this tool. Escalating to engineering." }#125;); }#125;
What the owner experiences in 90 days: they stop doing about 8 hours/week of calendar-and-email coordination. The agent has learned that Mondays are protected, that two specific clients always get same-day responses, and that the bookkeeper handles AP — never the agent. The trust took 90 days to earn. It earns back ~$60k/year of fractional-ops labor avoided, indefinitely.
Most agent demos work. Most agent products don't, because the production surface area is much larger than the demo surface area. Five things that bite teams in the first quarter of running a Managed Agent in production, named so you can design for them on day one.
Managed Agents pricing typically includes inference (per-token, per-model), tool execution (per-call for native tools, free for your custom functions), code execution (per-second of runtime), file storage (per-GB-month), and the platform fee itself. The cost-per-session for a real production agent in 2026 ranges from $0.02 (a quick triage call) to $5 (a long-running ops agent doing real work).
The blow-ups happen in three places. First: agents that hit memory bloat — context window costs grow with conversation length, and a long-running session with poor memory hygiene spends 10× what it should. Second: agents that get stuck in tool-call loops (5 retries on a failing MCP endpoint, each consuming tokens). Third: customer-facing agents being abused by users who realized they have a free LLM API.
The mitigation pattern: cost_alert_per_session_usd on every agent, hard caps per user per day, monitoring on the p99 of session cost (not the median — the long tail is where the money goes).
First-token latency is what determines whether your chat feels responsive. Managed Agent first-token p95 in 2026 is roughly 800ms–1.5s for simple agents (one or two tool calls), 2–5s for complex agents that need to plan and route through MCP. Tool calls add their own latency — a slow MCP server can dominate the wait. Code execution can run 5–30s depending on the work.
The user-experience pattern that wins: stream everything. Stream the agent's "thinking" message ("Looking up your AR aging…"), stream the tool-call events ("Fetched 47 open invoices, ranking…"), stream the final response token-by-token. Users tolerate a 12-second response if they see motion for all 12 seconds; they bounce at 4 seconds of dead air.
An agent's memory grows. Every conversation adds to user-memory. Every tool result the agent decides is "important" gets persisted. Three months in, your agent's memory state is bloated, retrieval is slower, and Claude is spending tokens on context that's no longer useful.
The discipline: memory expiration by default. Session memory expires when the session ends. User memory expires items by recency-of-relevance, not recency-of-write. Org memory is the only place long-lived facts live, and it should be explicitly written to, not implicitly accumulated.
Your agent's system prompt is the most important config in your product. It needs versioning, code review, eval gating, rollback capability. Most teams launch an agent and then iterate the system prompt in the platform UI like it's a Google Doc. Six months later they can't explain why responses changed and they can't revert.
The pattern that scales: system prompts live in your repo as code. Deploys happen through CI/CD. Every change runs the eval set before promote. The platform UI is for exploration; production uses your normal deployment process.
The day after you ship a customer-facing agent, somebody will try to prompt-inject it. The week after, somebody will try to use it as a free general-purpose LLM. The month after, somebody will try to get it to say something that screenshots well. These are not edge cases. They are the standard population of users you will have.
Plan for it: input filters on every customer-facing agent (prompt injection detection, abuse pattern matching, rate limiting per user). Refusal behavior that's been written down and evaluated. A way to silently flag suspicious conversations for human review without breaking the experience for legitimate users. A way to retract a response that went out wrong (you will need this).
An agent without an eval set is a script that happens to work. An agent with an eval set is a product. The eval set is the most important artifact in your build, and most teams ship without one because writing it feels like overhead. It isn't. It's the spec.
Your eval set is a collection of inputs (user messages, conversation histories, document uploads, tool results) paired with assertions about what the agent should do. Some assertions are deterministic ("should call tool X"); some are graded by a judge model ("response should cite invoice number"); some are graded by humans on a sample.
Every prompt change runs the full eval set. Every new failure mode found in production becomes a permanent eval. The eval set grows monotonically; it never shrinks. After six months it's a comprehensive description of what your agent is supposed to do, and onboarding a new engineer to the agent means showing them the evals before the prompts.
The teams that have working production agents in 2026 are the ones that built evals first and prompts second. The teams that have AI features that "work in demo but break in production" did it the other way around.
The platform handles most of the easy mistakes for you. The remaining mistakes are higher-order — architecture, scope, discipline — and they're the ones that turn a six-week build into a six-month rebuild.
You have a single-shot classification or extraction job running at scale. Wrapping it in an agent runtime adds latency, cost, and operational complexity for no benefit.
The decision matrix at the top of this page exists to catch this. If your build is one model call, it's an API call — not an agent.
Six weeks in, "the prompt feels good." Three months in, a prompt tweak breaks something nobody noticed because there was no spec.
Evals first. Prompts second. The eval set is the agent's spec; treat it like the test suite of your most important code.
Every tool the agent has access to is a permission boundary. Giving the agent write access to your CRM, your Calendar, your finance system, and your email — without explicit approval gates per category — is how the embarrassing screenshots happen.
Default deny. Approve writes per-category. Outbound communication stays human-sent. The approval routing pattern in the Tier-3 build is the production-grade version of this.
Putting org-level facts in session memory: agent forgets your business between sessions. Putting session state in user memory: agent confuses one conversation with another. Putting user data in org memory: tenant boundary violation.
Memory scope decisions are not implementation details. They are product decisions. Make them on day one and write them down.
An agent in a tool-call retry loop, or a customer abusing your free agent as a general-purpose LLM, can run up four-figure bills in a day. Most teams discover this from a billing alert, not from a cost cap.
Per-session caps, per-user-per-day caps, p99 monitoring. Set them before launch. The platform supports them; teams skip configuring them.
Three years from now, the pricing changes, or the platform deprecates a feature you depend on, or a competitor offers a 5x better product. If your entire agent architecture is opinionated on Managed Agents, your switching cost is months of work.
Keep the system prompt and eval set portable. Keep your tool definitions in your repo, not in the platform UI. Keep the integration plane (how your product calls the agent) abstracted behind a thin interface. The day you need to switch, the agent runtime is the only piece that should be Anthropic-specific.
If you're evaluating whether Managed Agents is the right call for a specific build on your roadmap — that's the conversation worth having before the architecture gets locked in. Free 30 minutes, direct technical read, no pitch.