Reseek / Claude Platform — Managed Agents

The Claude Platform series · 01 — Managed Agents

Managed Agents on the Claude Platform. The architecture decision before you write a line of code.

Anthropic's Managed Agents on platform.claude.com is the production runtime for agents — memory, tool execution, code interpreter, file handling, MCP wiring, and observability operated for you. The pitch: trade some architectural control for ~80% of the time-to-production. This page is for the CTO or VP Eng deciding whether that trade is right for your team, your stack, and your roadmap. The decision matrix, the seven-part architecture, three real reference builds, and the production concerns most teams discover three months in.

Start with the decision matrix → Jump to the reference builds

01 The platform decision 02 The seven-part architecture 03 Three reference builds 04 Production concerns 05 Evals & the iteration loop 06 How Managed Agent builds fail

A note on code in this page. The code snippets show the architectural shape of a Managed Agents build — names, structure, calling pattern — not necessarily the exact SDK syntax of any given week. The platform is evolving rapidly through 2026; verify specific API names, parameter shapes, and event types against the current platform.claude.com/docs at build time. The architectural decisions in this page are stable; the syntax may not be.

01 — The platform decision

Managed Agents vs. the API vs. Claude Code vs. roll-your-own. Pick correctly the first time.

The most expensive engineering mistake in 2026 isn't picking the wrong agent framework — it's picking the wrong layer of abstraction for what you're trying to build. Four real options. They serve different problems. Here's the honest comparison.

Anthropic now offers four meaningfully different products that solve overlapping but distinct problems. The naming makes them sound like a hierarchy; they're actually a portfolio. Choose by use case, not by ambition.

Layer	What it is	Best for	Watch out for
Claude (chat) claude.ai	The end-user product — humans typing into a chat interface, with Projects, Skills, integrations.	Internal team use, Claude Projects for SMB workflows, the small-business cookbook pattern. non-engineers	Not a developer surface. You can't programmatically invoke it; you can't ship it inside your product.
Claude Code CLI / IDE	Anthropic's coding agent in the terminal. Built for engineers working on a codebase, locally, with hooks, slash commands, MCP, sub-agents.	Internal developer productivity. Building & maintaining your team's codebase. Internal automations engineers run themselves. eng leverage	Wrong tool for production agents your users interact with. Not deployable as a service.
Claude API api.anthropic.com	Raw model inference. You bring everything else — memory, tools, persistence, evals, observability, retries, rate limiting.	Single-call use cases (classification, extraction, summarization). Bespoke architectures where you need full control. infra-heavy	You're rebuilding the agent runtime from scratch. Time-to-production: months, not weeks. Maintenance forever.
Managed Agents platform.claude.com	Anthropic-operated agent runtime. You define the system prompt, tools, memory model, MCP servers, guardrails. They run it, persist it, scale it, observe it.	Production agents your users interact with. In-product chat, support, recommendation, document Q&A. Internal agents replacing tier-1 ops work. production-ready	Less architectural control than rolling your own. Pricing model can surprise you at scale. Vendor concentration risk if Anthropic changes the product.

The decision in one paragraph

If you're shipping an agent your customers or employees interact with and you don't have a strategic reason to own the agent runtime — Managed Agents. If you're doing single-shot inference at scale (classification, extraction, content generation) — API directly. If you're making your engineering team faster on your codebase — Claude Code. If you're enabling non-engineers in your business to use AI on documents and workflows — Claude (chat) with Projects. Most companies end up using three of the four. The question for any given build is which one this build belongs to.

When Managed Agents is the wrong answer

Five honest reasons not to choose Managed Agents even when it looks like the right fit:

Hard latency budget under 800ms p99. Managed runtimes add overhead. If you need sub-second responses, raw API with your own optimization is more controllable.
Heavy bespoke tool execution. If your agent needs to run code in your infrastructure with access to internal systems that can't be exposed via MCP, you'll fight the platform.
Multi-LLM strategy. If your architecture deliberately routes between Claude, GPT, Gemini, and a local model based on cost/capability, Managed Agents pulls you toward Claude-only.
Strict data residency. Verify Anthropic's current regional offerings against your compliance requirements. If your data can't leave region X and Anthropic doesn't operate there yet, you're done.
Strong existing investment in LangGraph / LlamaIndex / AutoGen. If you have a working production stack and a team that knows it, the switching cost may exceed the benefit.

The honest read. Managed Agents is correct for the majority of small-to-mid-sized companies shipping agentic features in 2026. Not because the alternatives are bad, but because most engineering teams underestimate by 5–10x what it costs to operate a production agent: memory, persistence, retries, observability, evals, prompt versioning, cost monitoring, abuse detection, audit logs. Anthropic is operating that stack at scale. Reaching for it earlier than you think you should is usually the right call.

02 — The architecture

The seven parts every Managed Agent build touches.

A Managed Agent is not a model call. It's a system with seven components — each of which has architectural decisions that determine whether your build works in production or only in demos. Get these seven right and the agent will hold up. Get any of them wrong and you'll be debugging the wrong part of the stack for a quarter.

┌─────────────────────────────────────────────────────────────────────┐ │ MANAGED AGENT RUNTIME │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │ │ System Prompt│ │ Memory │ │ Tools & MCP │ │ │ │ + Guardrails │◄──►│ session/user │◄──►│ native + custom + │ │ │ │ │ │ /org │ │ MCP servers │ │ │ └──────────────┘ └──────────────┘ └──────────────────────┘ │ │ ▲ ▲ ▲ │ │ │ │ │ │ │ └───────────┐ │ ┌───────────────┘ │ │ ▼ ▼ ▼ │ │ ┌──────────────────────────┐ │ │ │ Claude (Opus/Sonnet) │ │ │ └──────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │ │ Code Exec │ │ File I/O │ │ Observability │ │ │ │ (sandboxed) │ │ (uploads, │ │ tracing, evals, │ │ │ │ │ │ artifacts) │ │ cost, latency │ │ │ └──────────────┘ └──────────────┘ └──────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ Your Backend Your Database Your Dashboards

System prompt & persona

The agent's job, voice, tone, constraints, and refusal behavior. Iterate this with evals (component 07) — not vibes.

This is where most agent quality lives. A great system prompt with mediocre tools beats mediocre prompt + great tools, every time.

Key decision: Static vs. dynamically composed. Static is easier to evaluate; dynamic gives you per-user customization at the cost of testability.

Memory model

Three scopes that matter: session (this conversation), user (this human across sessions), organization (everything the agent learned about this customer or tenant).

The wrong scope is the most common architecture mistake. Putting org-level facts in session memory means the agent forgets your business. Putting session-level state in org memory means it leaks across conversations.

Key decision: What goes in each scope, and when does memory expire. Default to narrower scopes than you think.

Tools & MCP

Three layers: native platform tools (web search, code, files), custom function-calling tools you define inline, and MCP servers you connect (your DB, your API, your CRM).

The tools an agent has access to are the surface area of what it can do. They're also the surface area of what it can do wrong. Every tool is a permission boundary that needs to be designed deliberately.

Key decision: Read-only vs. write-capable tools. Write tools need approval gates; treat them as you would database migrations.

Code execution

Sandboxed Python (and increasingly JS) runtime the agent can use to compute, transform data, render charts, or run validation logic. Underrated capability.

The "compute" leg of the stool. A document-Q&A agent without code exec can answer questions; one with code exec can also do math on the document, generate the chart, and validate its own answer.

Key decision: What runtime libraries to allow, and resource limits. Default tight; loosen by use case.

File handling

How files enter the agent (user uploads, integration pulls, generated artifacts) and how they leave (downloads, side effects to your system, returned in responses).

The most under-thought-about part of agent builds. File flow is where data leakage, large-context cost blowups, and unexpected processing failures happen.

Key decision: Where files are persisted and for how long. Files in agent memory cost money and create compliance surface.

Guardrails

Input filtering (block PII / prompt injection / abuse), output filtering (block hallucinations on critical facts, redact sensitive data), and refusal behavior (when the agent declines and what it says).

Not a layer you bolt on at the end. Guardrails are part of the agent's identity. Decide them while writing the system prompt.

Key decision: Where you intercept (pre-prompt vs. mid-tool vs. post-response). Each has different costs and latency.

Observability & evals

Tracing every turn (system prompt + memory state + tools invoked + tokens + cost + latency), running evals on every prompt change, alerting on drift.

The discipline that separates demos from production. If you can't replay a bad conversation and explain what happened, you don't have a production agent.

Key decision: Your eval set is the spec of the agent. Treat it like the test suite of the most important code in your product.

The integration plane (yours)

Everything beyond the runtime: how your product calls the agent, how it receives streamed events, how it surfaces tool approvals to users, how it stores conversation references in your DB.

Managed Agents operates the agent runtime; you still own the integration surface in your product. Plan for it.

Key decision: Sync vs. async invocation model. Async-with-streaming is correct for most chat surfaces; sync is correct for short triage agents.

The shape of an agent definition

Conceptually, defining a Managed Agent looks roughly like this. Names, parameter shapes, and event types will vary — verify against current docs — but the components are stable:

agent_definition.ts — the seven parts in code

TypeScriptIllustrative

import {#123; Anthropic }#125; from "@anthropic-ai/sdk";

const client = new Anthropic();

const agent = await client.agents.create({#123;
  name: "collections-agent-v3",

  // 01 — System prompt & persona
  system: `You are an AR collections specialist for Acme Corp.
  Your job: chase past-due invoices firmly but warmly.
  Voice: direct, no corporate hedging, never threaten.
  Always cite the specific invoice number and amount.
  Escalate to a human for accounts > 90 days or > $25k.`,

  // 02 — Memory scopes
  memory: {#123;
    session: {#123; enabled: true, ttl_hours: 24 }#125;,
    user:    {#123; enabled: true, scope: "customer_id" }#125;,
    org:     {#123; enabled: true, scope: "acme_corp" }#125;
  }#125;,

  // 03 — Tools & MCP
  tools: [
    {#123; type: "native", name: "web_search" }#125;,
    {#123; type: "function", definition: getInvoiceByIdSchema }#125;,
    {#123; type: "function", definition: draftChaseEmailSchema }#125;,
    {#123; type: "function", definition: markEscalatedSchema,
      requires_approval: true }#125;  // write tool — gated
  ],
  mcp_servers: [
    {#123; url: "mcp://quickbooks.acme.internal", scopes: ["read:ar"] }#125;
  ],

  // 04 — Code execution
  code_execution: {#123;
    enabled: true,
    runtime: "python-3.12",
    libraries: ["pandas", "numpy"],
    max_seconds: 30
  }#125;,

  // 05 — File handling
  files: {#123;
    accept_uploads: ["application/pdf", "text/csv"],
    max_file_size_mb: 25,
    retention_days: 30
  }#125;,

  // 06 — Guardrails
  guardrails: {#123;
    input:  {#123; block_pii: true, prompt_injection_filter: "strict" }#125;,
    output: {#123; block_hallucinated_amounts: true, redact_ssn: true }#125;,
    refusal: {#123; tone: "professional", escalation_contact: "ar-lead@acme.co" }#125;
  }#125;,

  // 07 — Observability
  observability: {#123;
    trace: "verbose",
    cost_alert_per_session_usd: 0.50,
    latency_alert_p99_ms: 8000,
    eval_set: "collections_eval_v3"
  }#125;
}#125;);

Read this code as architecture, not as syntax. The specific field names, nesting, and SDK methods will differ from what platform.claude.com publishes today. The seven concepts — system prompt, memory scopes, tool layers, code execution, file handling, guardrails, observability — are what every Managed Agent build configures, regardless of how the API surfaces them this quarter.

03 — Three reference builds

One internal agent, one customer-facing agent, one frontier agent.

Three builds, ordered from lowest to highest brand risk. Each shows the architectural decisions for a real production agent — not toy examples. Internal first (because tier-1 ops replacement is the highest-ROI starter project), customer-facing second (the bulk of the long-term value), frontier third (the build that pays for the consultant who helps you avoid the mistakes).

Tier 1 · Internal

The AR Collections Agent

Reads QuickBooks AR aging daily, drafts chase emails ranked by likelihood-to-pay × dollar value × days late, posts the queue to your bookkeeper for one-click approve-and-send. Handles soft-reply triage. Escalates the hard accounts to a human with full context. Replaces ~30% of a half-time bookkeeper's week. Brand risk: low — emails go through human approval before sending.

Memory scope

User-scoped to the bookkeeper running it; org-scoped on customer payment history (which arguments worked, what tone resonates with which AR contact, what's been promised).

Tools

MCP server: QuickBooks (read). Functions: draft_chase_email, get_customer_history, mark_for_human_review (gated). Native: code exec for AR-aging math.

Guardrails

Never threaten. Never quote a dollar amount the agent can't trace to a real invoice. Always cite invoice number. Escalate > 90 days or > $25k. Block any send that the agent's own confidence-on-draft score is below 7/10.

Where it lives

Slack — the bookkeeper opens a DM with the agent every morning at 8am, runs /aging, gets the ranked queue, approves and sends with reactions. Invocation is async-streaming; results stream back in 10–20 seconds.

Why this is the right starter build. Low brand risk (emails reviewed), high measurable ROI (days-to-collect on AR drops 15–25% in the first quarter), narrow scope (one tool, one user, one job). The agent has fewer than a dozen tool calls and the eval set fits in a single spreadsheet. If you can't ship this in 4 weeks of engineering effort, you have a process problem, not a Managed Agents problem.

invocation_pattern.ts — how the bookkeeper's Slack command triggers the agent

TypeScriptIllustrative

// Slack slash-command handler: /aging
export async function handleAgingCommand(req: SlackRequest) {#123;
  const session = await client.agents.sessions.create({#123;
    agent_id: "collections-agent-v3",
    user_id: req.user_id,
    metadata: {#123; trigger: "slash_aging", channel: req.channel_id }#125;
  }#125;);

  const stream = await client.agents.sessions.invoke({#123;
    session_id: session.id,
    message: "Pull today's AR aging. Rank top 10 chase candidates.",
    stream: true
  }#125;);

  for await (const event of stream) {#123;
    switch (event.type) {#123;
      case "tool_call_started":
        postEphemeral(req, `🔍 ${#123;event.tool_name}#125;…`);
        break;
      case "tool_call_completed":
        // stream progress to the user
        break;
      case "approval_required":
        postApprovalBlock(req, event.proposed_action);
        break;
      case "final_response":
        postRichResponse(req, event.content);
        break;
    }#125;
  }#125;
}#125;

What the bookkeeper sees: a ranked list of 10 AR chases, each with the suggested email draft, the customer's payment-pattern history, and Approve / Edit / Skip buttons. They process the queue in 12 minutes; the previous manual version took 90.

Tier 2 · Customer-facing

The Document Q&A Agent (in-product)

Embedded in your SaaS product. Customers upload their documents (contracts, reports, datasets) and ask questions. Agent answers, cites, computes, and produces artifacts (summary, redline, chart). Memory is user-scoped so the agent remembers what they've uploaded and what they've asked. Brand risk: medium — agent voice is your product's voice; wrong answers are your product's mistakes.

Memory scope

User-scoped for uploaded documents and conversation history. Session-scoped for the in-flight workflow state (currently analyzing X, comparing against Y). No org-scoping — strict tenant isolation is the architecture.

Tools

Native: file read, code execution (pandas, matplotlib), web search (gated by user setting). Functions: extract_clauses, compare_documents, generate_redline, render_chart.

Guardrails

Never claim a fact not present in the user's documents. Always cite the page/section. Refuse questions outside the scope of the uploaded materials (no general-knowledge Q&A drift). Redact any PII the user uploads from the response surface.

Where it lives

A chat sidebar inside your SaaS app. Invocation is async-streaming over WebSocket. Tool approvals (e.g., "send this redline as a comment to my collaborator") surface as inline UI affordances, not modals.

Why this is the build that defines your product's AI reputation. Internal agents have a forgiving audience. Customer-facing agents do not. The cost of one viral screenshot of a confidently wrong answer is higher than the cost of a quarter of engineering work to harden citations, refusal behavior, and tenant isolation. Spend the quarter. Then ship.

tenant_isolation.ts — the most important code in this build

TypeScriptIllustrative

export async function getOrCreateAgentSession(
  user: AuthenticatedUser,
  conversationId: string
) {#123;
  // Tenant isolation enforced at session creation.
  // Memory scope is BOUND to user_id — Anthropic enforces this server-side,
  // but we also encode it explicitly so any drift fails loudly.
  const session = await client.agents.sessions.findOrCreate({#123;
    agent_id: "doc-qa-prod",
    external_id: `${#123;user.tenant_id}#125;:${#123;user.id}#125;:${#123;conversationId}#125;`,
    user_id: user.id,
    memory_scope: {#123;
      user: user.id,
      tenant: user.tenant_id,
      conversation: conversationId
    }#125;,
    metadata: {#123;
      tenant_id: user.tenant_id,
      plan_tier: user.plan,
      // hard refusal for documents flagged as restricted
      document_class: await classifyDocsForUser(user.id)
    }#125;
  }#125;);

  // Assert invariant: this session's memory is NEVER accessible to
  // another tenant. If we ever ship a code path that bypasses this,
  // the entire product is broken.
  if (session.memory_scope.tenant !== user.tenant_id) {#123;
    throw new TenantBoundaryViolation(session.id);
  }#125;

  return session;
}#125;

What your customer sees: a chat that knows their documents, answers with footnoted citations, can render a chart inline, and refuses politely when asked something outside scope. What you see in your traces: ~$0.08/conversation, p95 first-token latency under 1.2s, citation accuracy above 96% on the eval set.

Tier 3 · Frontier

The Fractional Ops Agent

A multi-tool agent that does the work of a $70k junior ops hire across calendar, email, ticketing, and finance. Lives in Slack. Owners DM it with operational requests ("schedule the team offsite for week of June 8 with these constraints"; "reconcile last week's expenses against the budget"; "draft the proposal for [client] using our template"). It plans the work, executes through MCP tools, surfaces approval gates for anything write-shaped, and learns the business's patterns over time. Brand risk: high — this agent has more tool surface than the other two combined.

Memory scope

Org-scoped (the business this agent serves). User-scoped per requester. Session-scoped per task. Long-lived org memory holds the business's preferences, standards, and the "what I'm avoiding" list from the small-business playbooks.

Tools

MCP servers: Google Calendar (read+write), Gmail (read, draft-only), HubSpot (read+write), QuickBooks (read+write, gated), Drive (read+write). Functions: ~25 custom tools across scheduling, comms, finance, and project management.

Guardrails

Tiered approval gates: read operations free; calendar invites auto-send with audit log; emails draft-only; financial writes require owner approval. Hard refusal on: HR-shaped requests, legal-shaped requests, anything involving termination. Daily summary of every action taken posts to a private owner channel.

Where it lives

A Slack workspace bot. Async by design — the owner DMs and goes back to work; the agent posts back when done, sometimes minutes later. Long-running tasks (a multi-step scheduling negotiation across 6 people) can stretch over hours with periodic status updates.

This is the build that consultants charge for. The architecture is straightforward; the discipline isn't. The fractional ops agent has to interact with five external systems, hold organizational context across days, and maintain the right approval gates so an owner trusts it. Most teams ship a v1 that's exciting in demo and broken in week 3. The break is always the same place: insufficient guardrails on write tools, plus drift in the eval set as scope creeps. Build this last, with someone who's done it before, after the other two are in production.

approval_gate.ts — the most important pattern for high-tool-surface agents

TypeScriptIllustrative

// Approval routing for write-capable tools.
// The agent proposes; the owner approves; the system executes.
export async function handleApprovalRequest(event: AgentEvent) {#123;
  const action = event.proposed_action;

  // Cheap reads — auto-approve and log
  if (action.tool.access === "read") {#123;
    return client.agents.approve(event.id);
  }#125;

  // Calendar writes — auto-approve, but audit + reversible
  if (action.tool.name === "calendar.create_event") {#123;
    await auditLog({#123; event, decision: "auto_approved" }#125;);
    return client.agents.approve(event.id);
  }#125;

  // Outbound comms — DRAFT ONLY, ever. No exceptions.
  if (action.tool.name === "email.send") {#123;
    return client.agents.reject(event.id, {#123;
      reason: "Outbound email must be sent by a human.",
      alternative: "email.create_draft"
    }#125;);
  }#125;

  // Financial writes — owner approval required, real-time
  if (action.tool.category === "finance" && action.tool.access === "write") {#123;
    const decision = await requestOwnerApprovalViaSlack({#123;
      action,
      context: event.context,
      timeout_seconds: 300
    }#125;);
    return decision.approved
      ? client.agents.approve(event.id)
      : client.agents.reject(event.id, {#123; reason: decision.reason }#125;);
  }#125;

  // Anything not explicitly routed: HARD REJECT.
  // Better to fail loudly than to ship an over-permissioned agent.
  return client.agents.reject(event.id, {#123;
    reason: "No approval policy defined for this tool. Escalating to engineering."
  }#125;);
}#125;

What the owner experiences in 90 days: they stop doing about 8 hours/week of calendar-and-email coordination. The agent has learned that Mondays are protected, that two specific clients always get same-day responses, and that the bookkeeper handles AP — never the agent. The trust took 90 days to earn. It earns back ~$60k/year of fractional-ops labor avoided, indefinitely.

04 — Production concerns

The things you discover three months in.

Most agent demos work. Most agent products don't, because the production surface area is much larger than the demo surface area. Five things that bite teams in the first quarter of running a Managed Agent in production, named so you can design for them on day one.

Cost — the surprises are real and predictable

Managed Agents pricing typically includes inference (per-token, per-model), tool execution (per-call for native tools, free for your custom functions), code execution (per-second of runtime), file storage (per-GB-month), and the platform fee itself. The cost-per-session for a real production agent in 2026 ranges from $0.02 (a quick triage call) to $5 (a long-running ops agent doing real work).

The blow-ups happen in three places. First: agents that hit memory bloat — context window costs grow with conversation length, and a long-running session with poor memory hygiene spends 10× what it should. Second: agents that get stuck in tool-call loops (5 retries on a failing MCP endpoint, each consuming tokens). Third: customer-facing agents being abused by users who realized they have a free LLM API.

The mitigation pattern: cost_alert_per_session_usd on every agent, hard caps per user per day, monitoring on the p99 of session cost (not the median — the long tail is where the money goes).

Latency — what your users actually feel

First-token latency is what determines whether your chat feels responsive. Managed Agent first-token p95 in 2026 is roughly 800ms–1.5s for simple agents (one or two tool calls), 2–5s for complex agents that need to plan and route through MCP. Tool calls add their own latency — a slow MCP server can dominate the wait. Code execution can run 5–30s depending on the work.

The user-experience pattern that wins: stream everything. Stream the agent's "thinking" message ("Looking up your AR aging…"), stream the tool-call events ("Fetched 47 open invoices, ranking…"), stream the final response token-by-token. Users tolerate a 12-second response if they see motion for all 12 seconds; they bounce at 4 seconds of dead air.

Memory hygiene — the silent cost

An agent's memory grows. Every conversation adds to user-memory. Every tool result the agent decides is "important" gets persisted. Three months in, your agent's memory state is bloated, retrieval is slower, and Claude is spending tokens on context that's no longer useful.

The discipline: memory expiration by default. Session memory expires when the session ends. User memory expires items by recency-of-relevance, not recency-of-write. Org memory is the only place long-lived facts live, and it should be explicitly written to, not implicitly accumulated.

Prompt versioning — treat it like code

Your agent's system prompt is the most important config in your product. It needs versioning, code review, eval gating, rollback capability. Most teams launch an agent and then iterate the system prompt in the platform UI like it's a Google Doc. Six months later they can't explain why responses changed and they can't revert.

The pattern that scales: system prompts live in your repo as code. Deploys happen through CI/CD. Every change runs the eval set before promote. The platform UI is for exploration; production uses your normal deployment process.

Abuse & safety — the post-launch surprise

The day after you ship a customer-facing agent, somebody will try to prompt-inject it. The week after, somebody will try to use it as a free general-purpose LLM. The month after, somebody will try to get it to say something that screenshots well. These are not edge cases. They are the standard population of users you will have.

Plan for it: input filters on every customer-facing agent (prompt injection detection, abuse pattern matching, rate limiting per user). Refusal behavior that's been written down and evaluated. A way to silently flag suspicious conversations for human review without breaking the experience for legitimate users. A way to retract a response that went out wrong (you will need this).

05 — Evals & the iteration loop

The discipline that separates demos from production.

An agent without an eval set is a script that happens to work. An agent with an eval set is a product. The eval set is the most important artifact in your build, and most teams ship without one because writing it feels like overhead. It isn't. It's the spec.

Your eval set is a collection of inputs (user messages, conversation histories, document uploads, tool results) paired with assertions about what the agent should do. Some assertions are deterministic ("should call tool X"); some are graded by a judge model ("response should cite invoice number"); some are graded by humans on a sample.

What goes in the eval set

The golden path: 20–40 scenarios that represent the agent's most common, expected interactions. These are the things the agent has to be excellent at, every time.
The edge cases: 30–60 scenarios for things the agent will see less often but must handle correctly — refusals, ambiguous requests, tool failures, retries, partial information.
The adversarial set: 20–40 scenarios designed to make the agent fail — prompt injections, attempts to extract the system prompt, abuse patterns, tenant-boundary probes.
The regression set: every bug you ever fixed becomes a permanent eval. This is how you make sure you don't reintroduce the same problem in three months.

How the iteration loop runs

┌──────────────────────────────────────────────────────────────────────┐ │ THE ITERATION LOOP │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Production │ ───► │ Trace │ ───► │ Identify │ │ │ │ traffic │ │ inspection │ │ failures │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ ▲ │ │ │ │ ▼ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Deploy │ ◄─── │ Run evals │ ◄─── │ Add eval + │ │ │ │ to prod │ │ on candidate│ │ fix prompt │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ │ ▼ (if regression) │ │ ┌─────────────┐ │ │ │ Reject & │ │ │ │ iterate │ │ │ └─────────────┘ │ │ │ └──────────────────────────────────────────────────────────────────────┘

Every prompt change runs the full eval set. Every new failure mode found in production becomes a permanent eval. The eval set grows monotonically; it never shrinks. After six months it's a comprehensive description of what your agent is supposed to do, and onboarding a new engineer to the agent means showing them the evals before the prompts.

The teams that have working production agents in 2026 are the ones that built evals first and prompts second. The teams that have AI features that "work in demo but break in production" did it the other way around.

06 — How Managed Agent builds fail

Six failure modes I've seen in real builds. Avoid them.

The platform handles most of the easy mistakes for you. The remaining mistakes are higher-order — architecture, scope, discipline — and they're the ones that turn a six-week build into a six-month rebuild.

Failure 01

Reaching for Managed Agents when you needed the API

You have a single-shot classification or extraction job running at scale. Wrapping it in an agent runtime adds latency, cost, and operational complexity for no benefit.

The decision matrix at the top of this page exists to catch this. If your build is one model call, it's an API call — not an agent.

Failure 02

Shipping without an eval set

Six weeks in, "the prompt feels good." Three months in, a prompt tweak breaks something nobody noticed because there was no spec.

Evals first. Prompts second. The eval set is the agent's spec; treat it like the test suite of your most important code.

Failure 03

Over-permissioned tools

Every tool the agent has access to is a permission boundary. Giving the agent write access to your CRM, your Calendar, your finance system, and your email — without explicit approval gates per category — is how the embarrassing screenshots happen.

Default deny. Approve writes per-category. Outbound communication stays human-sent. The approval routing pattern in the Tier-3 build is the production-grade version of this.

Failure 04

Memory scope confusion

Putting org-level facts in session memory: agent forgets your business between sessions. Putting session state in user memory: agent confuses one conversation with another. Putting user data in org memory: tenant boundary violation.

Memory scope decisions are not implementation details. They are product decisions. Make them on day one and write them down.

Failure 05

Skipping the cost cap

An agent in a tool-call retry loop, or a customer abusing your free agent as a general-purpose LLM, can run up four-figure bills in a day. Most teams discover this from a billing alert, not from a cost cap.

Per-session caps, per-user-per-day caps, p99 monitoring. Set them before launch. The platform supports them; teams skip configuring them.

Failure 06

Vendor concentration without a back door

Three years from now, the pricing changes, or the platform deprecates a feature you depend on, or a competitor offers a 5x better product. If your entire agent architecture is opinionated on Managed Agents, your switching cost is months of work.

Keep the system prompt and eval set portable. Keep your tool definitions in your repo, not in the platform UI. Keep the integration plane (how your product calls the agent) abstracted behind a thin interface. The day you need to switch, the agent runtime is the only piece that should be Anthropic-specific.

The teams that ship working agents in 2026 aren't the ones with the best frameworks. They're the ones who chose the right layer of abstraction and built the discipline around it.

If you're evaluating whether Managed Agents is the right call for a specific build on your roadmap — that's the conversation worth having before the architecture gets locked in. Free 30 minutes, direct technical read, no pitch.

Book the architecture call → Or see the Claude Code course