What We Can Learn from Anthropic to Build Our Own Agent
A deep-dive on Anthropic engineering blog posts, distilling every actionable lesson into a practical guide for building production-grade AI agents.
Table of Contents
- Introduction
- The Six Pillars
- Lesson 1: Start Simple
- Lesson 2: Master the Five Workflow Patterns
- Lesson 3: Design Your Agent-Computer Interface
- Lesson 4: Context Engineering
- Lesson 5: Build Effective Tools
- Lesson 6: Harness Long-Running Agents
- Lesson 7: Scale with Multi-Agent Architectures
- Lesson 8: Evaluation-Driven Development
- The Agent Builder’s Checklist
- Conclusion
Introduction
Anthropic has published a series of deeply practical engineering blog posts that, taken together, form arguably the most comprehensive public guide to building AI agents today. These aren’t theoretical musings — they are battle-tested lessons from teams that built Claude Code, the Claude Research feature, and worked with dozens of enterprise customers deploying agents in production.
This blog synthesizes all six articles into a single, actionable guide for building your own agent.
The Six Pillars
| # | Article | Core Theme | Key Takeaway |
|---|---|---|---|
| 1 | Building Effective AI Agents | Architecture patterns | Start simple; use composable patterns instead of complex frameworks |
| 2 | Effective Context Engineering | Context management | Context is a finite resource — engineer it like a precious budget |
| 3 | Writing Effective Tools | Tool design | Tools are contracts between deterministic systems and non-deterministic agents |
| 4 | Effective Harnesses for Long-Running Agents | Multi-session persistence | Use initializer + coding agent pattern with structured state handoff |
| 5 | How We Built Our Multi-Agent Research System | Multi-agent coordination | Multi-agent systems excel at parallelizable, high-value tasks |
| 6 | Demystifying Evals for AI Agents | Evaluation methodology | Eval-driven development is non-negotiable for production agents |
Lesson 1: Start Simple — The Complexity Spectrum
Anthropic’s single most repeated piece of advice across all six articles:
“Find the simplest solution possible, and only increase complexity when needed.”
This isn’t just philosophy — it’s a practical heuristic. The most successful agent implementations they’ve seen don’t use complex frameworks. They use simple, composable patterns built directly on LLM APIs.
The Complexity Decision Framework
- Can a single optimized LLM call with retrieval and in-context examples solve this? If yes, stop here.
- Is the task well-defined with predictable subtasks? Use a workflow (predefined code paths).
- Does the task require flexibility and model-driven decisions? Use an agent (LLM controls the flow).
Agents vs Workflows — A Critical Distinction
- Workflows: LLMs and tools orchestrated through predefined code paths. The developer controls the flow.
- Agents: LLMs dynamically direct their own processes and tool usage. The model controls the flow.
Both are “agentic systems,” but knowing which you need prevents over-engineering.
Lesson 2: Master the Five Workflow Patterns
Anthropic identifies five foundational workflow patterns that cover the vast majority of production use cases. These are building blocks you can combine and customize.
1. Prompt Chaining
Sequential steps where each LLM call processes the output of the previous one. Add programmatic “gates” between steps for quality checks.
1
Input → [LLM Step 1] → Gate (validate) → [LLM Step 2] → Gate → [LLM Step 3] → Output
When to use: Tasks decomposable into fixed sequential steps. Example: Generate marketing copy → check tone → translate to another language.
2. Routing
Classify the input, then direct it to a specialized handler. This lets you route easy queries to fast/cheap models and hard queries to powerful ones.
1
2
3
Input → [Classifier LLM] → Route A: Handler for refunds
→ Route B: Handler for technical issues
→ Route C: Handler for general inquiries
When to use: Distinct input categories needing different handling. Example: Customer support triage, model selection based on complexity.
3. Parallelization
Two variations:
- Sectioning: Break task into independent subtasks, run in parallel, merge results.
- Voting: Run same task multiple times for diverse perspectives, aggregate.
When to use: Independent subtasks or need for multiple perspectives. Example: Code review where one LLM writes code and another screens it for vulnerabilities. Content moderation with multiple guardrail checks.
4. Orchestrator-Workers
A central LLM dynamically breaks down tasks, delegates to worker LLMs, and synthesizes results. Unlike parallelization, subtasks are NOT predefined — they’re determined at runtime by the orchestrator.
When to use: Complex tasks where subtasks aren’t predictable upfront. Example: Multi-file code changes, multi-source research tasks.
5. Evaluator-Optimizer
One LLM generates a response → another evaluates and provides feedback → the generator refines. This loop continues until quality criteria are met.
1
[Generator LLM] → output → [Evaluator LLM] → feedback → [Generator LLM] → improved output → ...
When to use: Tasks with clear evaluation criteria where iterative refinement adds measurable value. Example: Literary translation, complex search refinement, code optimization.
Pattern Selection Guide
| Pattern | Best For | Example |
|---|---|---|
| Prompt Chaining | Fixed sequential steps | Generate → translate → format |
| Routing | Distinct input categories | Support triage |
| Parallelization | Independent subtasks / voting | Guardrails, code review |
| Orchestrator-Workers | Unpredictable subtasks | Multi-file refactoring |
| Evaluator-Optimizer | Clear quality criteria + iterative refinement | Translation, search |
The Autonomous Agent Loop
Beyond workflows, a true autonomous agent is simply an LLM using tools based on environmental feedback in a loop:
1
2
3
4
5
6
while not done:
action = llm.decide(context, tools)
result = execute(action)
context.update(result)
if llm.should_stop(context):
done = True
Three core principles for autonomous agents:
- Maintain simplicity in the agent loop design
- Prioritize transparency — show the agent’s planning steps to the user
- Carefully craft the ACI with thorough tool documentation and testing
Lesson 3: Design Your Agent-Computer Interface (ACI)
One of Anthropic’s most original insights is the Agent-Computer Interface (ACI) — the analog of HCI (Human-Computer Interface), but for AI agents. Anthropic reports spending more time optimizing tools than the overall prompt when building their SWE-bench agent.
ACI Design Principles
Put yourself in the model’s shoes. If tool usage isn’t obvious from its description and parameters, it won’t be obvious to the model. Good tool definitions include example usage, edge cases, input format requirements, and clear boundaries.
Think like writing a great docstring for a junior developer. Especially important when using many similar tools — parameter names and descriptions must disambiguate.
Test empirically. Run many example inputs to see what mistakes the model makes, then iterate.
Poka-yoke your tools (error-proof them). Change arguments so mistakes become harder. Example: Anthropic changed their file edit tool to require absolute file paths after seeing errors with relative paths.
Choose formats the model can write easily:
- Don’t require diffs (the model must predict chunk headers before writing code)
- Don’t require JSON-escaped code (extra escaping of newlines and quotes)
- Keep formats close to what appears naturally in training data
Tool Format Decision Guide
| Format Choice | Why It Matters |
|---|---|
| Absolute paths > relative paths | Eliminates state-dependent errors |
| Markdown code blocks > JSON-wrapped code | No escaping overhead |
| Full file rewrites > diffs | No need to count lines in advance |
| Provide “thinking” tokens before committal output | Prevents painting into corners |
Lesson 4: Context Engineering — The New Frontier
Anthropic argues that we’re moving beyond “prompt engineering” to context engineering — the art of curating the optimal set of tokens at every inference step.
Why Context Is a Finite Resource
LLMs have limited working memory. Anthropic identifies a phenomenon called context rot: as token count increases, the model’s ability to accurately recall information decreases. This stems from the transformer’s n² pairwise attention relationships — as context grows, attention gets stretched thin.
System Prompts: Find the Right Altitude
Anthropic identifies a “Goldilocks zone” for system prompt specificity:
| Failure Mode | Problem | Solution |
|---|---|---|
| Too specific (hardcoded if-else logic) | Brittle, high maintenance | Provide heuristics, not hard rules |
| Too vague (high-level hand-waving) | Model assumes shared context that doesn’t exist | Be specific enough to guide behavior |
| Right altitude | Specific enough to guide, flexible enough for heuristics | Start minimal, add based on failure modes |
Best practices:
- Organize into distinct sections (
<background_information>,<instructions>,## Tool guidance) - Use XML tagging or Markdown headers for delineation
- Strive for the minimal set of information that fully outlines expected behavior
- Start with the best model + minimal prompt, then add instructions based on observed failures
Context Retrieval: Just-in-Time vs. Pre-Computed
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Pre-Computed (Classic RAG) | Embed data upfront, retrieve by similarity at query time | Fast, predictable cost | Stale indexes, irrelevant context floods attention |
| Just-in-Time (Agentic) | Maintain lightweight references; agent dynamically loads data using tools | Always fresh, only relevant context enters window | Slower, requires well-designed tools |
Anthropic’s Claude Code uses a hybrid approach: CLAUDE.md files are dropped into context upfront, while glob and grep tools allow just-in-time navigation — bypassing stale indexing issues.
Three Techniques for Long-Horizon Context
| Technique | How It Works | Best For |
|---|---|---|
| Compaction | Summarize conversation near context limits, reinitialize with the summary | Extensive back-and-forth tasks |
| Structured Note-Taking | Agent writes persistent notes outside context (e.g., NOTES.md, to-do lists) | Iterative development with milestones |
| Sub-Agent Architectures | Specialized sub-agents handle focused tasks with clean context windows, return condensed summaries | Complex research, parallel exploration |
Compaction example from Claude Code: The model summarizes message history, preserving architectural decisions, unresolved bugs, and implementation details while discarding redundant tool outputs. It continues with compressed context plus the five most recently accessed files.
Structured note-taking example from Claude playing Pokémon: The agent maintains precise tallies across thousands of game steps (“for the last 1,234 steps I’ve been training my Pokémon in Route 1, Pikachu has gained 8 levels toward the target of 10”), develops maps of explored regions, and remembers combat strategies — all persisted outside the context window.
Lesson 5: Build Effective Tool Ecosystems
Anthropic’s tool design article introduces a paradigm shift: tools are a new kind of software — a contract between deterministic systems and non-deterministic agents. When a user asks “Should I bring an umbrella today?”, an agent might call the weather tool, answer from general knowledge, or ask a clarifying question. This non-determinism requires fundamentally rethinking how we write software.
Five Principles for Effective Tools
Principle 1: Choose the Right Tools to Implement
More tools ≠ better outcomes. Don’t just wrap every API. Build few, thoughtful tools for high-impact workflows:
| Instead of… | Build… |
|---|---|
list_users + list_events + create_event | schedule_event (handles availability + creation) |
read_logs (returns everything) | search_logs (returns relevant lines + context) |
get_customer + list_transactions + list_notes | get_customer_context (compiles all relevant info) |
Principle 2: Namespace Your Tools
Group related tools under common prefixes to delineate boundaries:
- By service:
asana_search,jira_search - By resource:
asana_projects_search,asana_users_search
Prefix- vs suffix-based naming has non-trivial effects on evaluations — test this choice.
Principle 3: Return Meaningful Context
Prioritize contextual relevance over flexibility. Avoid low-level technical identifiers:
- Use
name,image_url,file_type— notuuid,256px_image_url,mime_type - Resolving UUIDs to natural language names significantly reduces hallucinations
- Expose a
response_formatenum ("concise"vs"detailed") to let agents control verbosity
Principle 4: Optimize for Token Efficiency
- Implement pagination, filtering, truncation with sensible defaults
- Claude Code: 25,000 token default limit per tool response
- Steer agents toward small, targeted searches instead of single broad ones
- Prompt-engineer error responses to communicate specific, actionable improvements (not opaque tracebacks)
Principle 5: Prompt-Engineer Your Tool Descriptions
Tool descriptions are loaded into the agent’s context — they collectively steer behavior. Think of describing your tool to a new hire:
- Make implicit context explicit (specialized query formats, niche terminology, resource relationships)
- Use unambiguous parameter names:
user_idnotuser - Impact: Claude Sonnet 3.5 achieved SWE-bench SOTA after precise tool description refinements
- Impact: A tool-testing agent that rewrote descriptions achieved 40% decrease in task completion time
The Evaluation-Driven Tool Improvement Loop
Anthropic’s most powerful technique: use agents to improve the tools you give them.
- Build a prototype of your tools (use Claude Code to one-shot them)
- Create evaluation tasks grounded in real-world uses — strong tasks require multiple tool calls
- Run the evaluation with simple agentic loops (
while-loops wrapping alternating LLM + tool calls) - Analyze results — read transcripts, observe confusion, track tool-calling metrics
- Let Claude analyze transcripts and refactor tools — paste eval transcripts into Claude Code
- Repeat until strong performance on held-out test sets
This process yielded improvements beyond expert human-written tools in Anthropic’s Slack and Asana evaluations.
Lesson 6: Harness Long-Running Agents
For tasks spanning hours or days, agents face the “shift change” problem: each new context window starts with no memory of what came before. Anthropic developed a two-part solution.
The Initializer + Coding Agent Pattern
Key Insights for Long-Running Agents
- Use JSON for feature lists — models are less likely to inappropriately modify JSON compared to Markdown
- Git is your state management system — agents can revert bad changes and recover working states
- Always test before implementing — start each session with a basic end-to-end test to catch broken state from the previous session
- Explicitly require end-to-end testing — without prompting, agents tend to declare features complete after writing code without verifying them as a real user would
- Long-horizon conversation management — agents should summarize completed work phases and store essential information in external memory before proceeding to new tasks
- Subagent output to filesystem — subagents can write outputs directly to external systems (files, databases), then pass lightweight references back to the coordinator, preventing information loss during multi-stage processing
Lesson 7: Scale with Multi-Agent Architectures
Anthropic’s Research feature uses a multi-agent orchestrator-worker system where a lead agent coordinates while specialized subagents operate in parallel. The results: multi-agent Claude Opus 4 + Sonnet 4 subagents outperformed single-agent Opus 4 by 90.2% on their internal research eval.
Why Multi-Agent Works
In Anthropic’s analysis, three factors explained 95% of performance variance on BrowseComp:
- Token usage (80% of variance by itself)
- Number of tool calls
- Model choice
Multi-agent architectures scale token usage by distributing work across agents with separate context windows. Each subagent explores extensively but returns only a condensed summary.
The Token Economics
| Interaction Type | Relative Token Usage |
|---|---|
| Chat | 1× (baseline) |
| Single Agent | ~4× |
| Multi-Agent | ~15× |
Multi-agent systems burn through tokens fast. They require high-value tasks where performance gains justify the cost. Best fit: tasks involving heavy parallelization, information exceeding single context windows, and interfacing with numerous complex tools.
The Architecture: Orchestrator-Worker
1
2
3
4
5
6
7
8
9
10
11
12
13
14
User Query
↓
[Lead Researcher Agent] ← Extended thinking for planning
↓ saves plan to Memory
├── [Subagent 1: "AI startups 2025"] ← interleaved thinking after each tool result
├── [Subagent 2: "Enterprise AI adoption"] ← parallel web search
└── [Subagent 3: "AI regulation landscape"] ← independent context window
↓ condensed findings returned
[Lead Researcher Agent] ← synthesizes, decides if more research needed
↓ if yes → spawn more subagents
↓ if done →
[Citation Agent] ← processes documents, attributes sources
↓
Final Research Report with Citations
Seven Principles for Multi-Agent Prompting
1. Think like your agents. Build simulations with the exact prompts and tools from your system. Watch agents work step-by-step to reveal failure modes (agents continuing when they already have results, verbose search queries, wrong tool selection).
2. Teach the orchestrator how to delegate. Each subagent needs: an objective, an output format, guidance on tools/sources, and clear task boundaries. Without detail, agents duplicate work or leave gaps. Example failure: one subagent explored the 2021 automotive chip crisis while two others duplicated work on 2025 supply chains.
3. Scale effort to query complexity. Embed scaling rules in prompts:
- Simple fact-finding: 1 agent, 3-10 tool calls
- Direct comparisons: 2-4 subagents, 10-15 calls each
- Complex research: 10+ subagents with clearly divided responsibilities
4. Tool design and selection are critical. An agent searching the web for context that only exists in Slack is doomed. Give agents explicit heuristics: examine all available tools first, match tool usage to user intent, prefer specialized tools over generic ones.
5. Let agents improve themselves. Claude 4 models are excellent prompt engineers. A tool-testing agent that tested an MCP tool dozens of times and rewrote its description achieved a 40% decrease in task completion time for future agents.
6. Start wide, then narrow down. Mirror expert human research: explore the landscape before drilling into specifics. Agents default to overly long, specific queries that return few results. Prompt agents to start with short, broad queries, then progressively narrow.
7. Guide the thinking process + parallelize. Extended thinking serves as a controllable scratchpad. Interleaved thinking after tool results helps agents evaluate quality and identify gaps. Parallel tool calling transforms speed:
- Lead agent spins up 3-5 subagents in parallel (not serially)
- Subagents use 3+ tools in parallel
- Result: up to 90% reduction in research time for complex queries
Production Reliability Challenges
| Challenge | Solution |
|---|---|
| Agents are stateful; errors compound | Build resume-from-checkpoint systems; use retry logic + regular checkpoints; let agents know when tools fail and adapt |
| Non-deterministic debugging | Add full production tracing; monitor agent decision patterns without reading conversation content |
| Deployment disrupts running agents | Use rainbow deployments — gradually shift traffic old → new while both run simultaneously |
| Synchronous execution bottlenecks | Consider async execution for more parallelism, but manage coordination and state consistency carefully |
Lesson 8: Evaluation-Driven Development
Anthropic’s evaluation article is perhaps the most practically important for anyone building production agents. Their core message: evals are non-negotiable, and their value compounds over the entire agent lifecycle.
Why Evals Matter
“Writing evals is useful at any stage. Early on, evals force product teams to specify what success means. Later, they uphold a consistent quality bar.”
Teams without evals face weeks of testing when new models come out; teams with evals can upgrade in days. Evals also become the highest-bandwidth communication channel between product and research teams, defining metrics researchers can optimize against.
Real-world examples:
- Descript built evals around three dimensions: don’t break things, do what I asked, do it well. They evolved from manual grading to LLM judges with periodic human calibration.
- Bolt built an eval system in 3 months that runs their agent and grades outputs with static analysis, browser agents, and LLM judges.
- Claude Code started with fast iteration from user feedback, then added evals — first for narrow areas like code generation, then expanding into broader behavioral evals.
The Structure of an Agent Evaluation
An evaluation consists of:
- Task: An input + environment for the agent (e.g., “fix this auth bypass vulnerability”)
- Graders: Logic that scores some aspect of the agent’s performance. A task can have multiple graders with multiple assertions.
- Transcript/Trace: The full record of the agent’s actions (tool calls, reasoning, outputs)
- Metrics: Quantitative measurements (turns taken, tokens used, latency)
- Eval Suite: A collection of tasks designed to measure specific capabilities
Three Types of Graders
| Grader Type | Methods | Strengths | Weaknesses |
|---|---|---|---|
| Code-Based | String match, binary tests, static analysis, tool call verification, transcript analysis | Fast, cheap, objective, reproducible | Brittle to valid variations, lacking nuance |
| Model-Based | LLM-as-judge with rubrics, pairwise comparison, multi-dimension scoring | Flexible, handles nuance, scales well | Requires calibration, non-deterministic |
| Human | Expert review, user testing, adversarial testing | Catches what automation misses, finds edge cases | Expensive, slow, doesn’t scale |
Best practice: Use deterministic graders where possible, LLM graders where necessary, and human graders for validation. Don’t over-specify the agent’s path — grade what the agent produced, not the path it took.
Capability vs. Regression Evals
Evaluating Different Agent Types
Coding Agents: Rely on well-specified tasks, stable test environments, and thorough tests. Deterministic graders are natural — does the code run and do the tests pass? SWE-bench Verified and Terminal-Bench follow this approach. LLMs progressed from 40% to >80% on SWE-bench in just one year.
Conversational Agents: Require a second LLM to simulate the user. Success is multidimensional: is the ticket resolved (state check), did it finish in <10 turns (transcript constraint), was the tone appropriate (LLM rubric)?
Research Agents: Combine groundedness checks (claims supported by sources), coverage checks (key facts included), and source quality checks (authoritative vs. first-retrieved). LLM rubrics should be frequently calibrated against expert human judgment.
Computer Use Agents: Require running the agent in a real or sandboxed environment and checking whether it achieved the intended outcome. Balance DOM-based interactions (fast, token-heavy) with screenshot-based interactions (slower, token-efficient).
Handling Non-Determinism
Agent behavior varies between runs. Two metrics help:
- pass@k: Probability of at least one success in k attempts. As k rises, score rises. Useful when one success is sufficient.
- pass^k: Probability of ALL k trials succeeding. As k rises, score falls. Useful for customer-facing agents where consistency matters.
At k=1 they’re identical. At k=10 they tell opposite stories: pass@k approaches 100% while pass^k approaches 0%.
The Roadmap from Zero to Great Evals
| Step | Action | Detail |
|---|---|---|
| 0 | Start early | 20-50 tasks from real failures is a great start. Don’t wait for hundreds. |
| 1 | Automate manual checks | Convert bug reports and manual QA into test cases. Prioritize by user impact. |
| 2 | Write unambiguous tasks | Two experts should independently reach the same pass/fail verdict. Create reference solutions. |
| 3 | Build balanced problem sets | Test both when behavior SHOULD occur and when it SHOULDN’T. Avoid class-imbalanced evals. |
| 4 | Build a robust eval harness | Each trial from a clean environment. No shared state between runs. |
| 5 | Design graders thoughtfully | Grade what was produced, not the path taken. Build in partial credit for multi-component tasks. |
| 6 | Run and iterate | Track latency, tokens, cost, error rates alongside accuracy. Calibrate LLM judges against humans. |
Critical lesson: Opus 4.5 initially scored 42% on CORE-Bench, but after fixing grading bugs, ambiguous specs, and stochastic tasks, the score jumped to 95%. A 0% pass@100 rate usually signals a broken task, not an incapable agent.
The Agent Builder’s Checklist
Phase 1: Design
- Start with the simplest possible solution (single LLM call + retrieval)
- Only add agent complexity when simpler solutions demonstrably fall short
- Choose the right pattern: workflow (predefined) vs agent (model-directed)
- Define success criteria BEFORE building — write initial eval tasks
Phase 2: Tools & Context
- Design ACI with same care as HCI — invest heavily in tool documentation and testing
- Build few, thoughtful tools (not 1:1 API wrappers); consolidate multi-step operations
- Namespace tools clearly; test prefix vs suffix naming
- Optimize tool responses for token efficiency (pagination, filtering, truncation)
- Engineer context at every inference step — treat tokens as a finite budget
- Use hybrid retrieval: static context files + just-in-time tool-based retrieval
Phase 3: Implementation
- Keep agent loops simple:
whileloop with LLM decision → tool execution → context update - Implement compaction for long conversations (summarize near context limits)
- Use structured note-taking for multi-session persistence (progress files, feature lists)
- Use git for state management in long-running coding agents
- Test before implementing each session — catch broken state early
Phase 4: Multi-Agent (Only If Needed)
- Verify the task genuinely benefits from parallelization
- Use orchestrator-worker pattern with clear delegation instructions
- Scale effort to query complexity (embed rules in prompts)
- Enable parallel tool calling for both lead agent and subagents
- Use extended thinking + interleaved thinking for planning and evaluation
- Build resume-from-checkpoint systems for error recovery
Phase 5: Evaluation
- Start with 20-50 tasks from real usage patterns and failures
- Combine code-based, model-based, and human graders
- Maintain separate capability evals (hill to climb) and regression evals (catch drift)
- Run evals in isolated, clean environments — no shared state
- Graduate high-performing capability evals to regression suites
- Track pass@k for tasks where one success matters; pass^k where consistency matters
- Use agents to analyze eval transcripts and improve tools iteratively
Conclusion
The lessons from Anthropic’s six engineering articles converge on a single philosophy: simplicity is not the starting point — it’s the goal. The teams that succeed with agents are not those who build the most complex systems, but those who find the minimum viable complexity for their task and then invest ruthlessly in the details that matter — tool design, context engineering, and evaluation.
Here are the five meta-lessons that emerge from synthesizing all six articles:
Complexity is a cost, not a feature. Every layer you add (multi-agent, long-horizon, async execution) introduces new failure modes. Add layers only when measurement proves they help.
Tools are the new UX. The Agent-Computer Interface deserves the same design rigor as a human-facing product. Anthropic spent more time on tool design than prompt design for their best agents.
Context is the bottleneck. It’s not about how much you can fit in the context window — it’s about the signal-to-noise ratio of every token. Engineer context like you engineer code: deliberately, testably, and with a bias toward deletion.
Evals are the foundation. Without evals, you’re guessing. With evals, you’re engineering. Start early, start small, and let your eval suite compound in value over time.
Let agents improve agents. One of the most striking patterns across Anthropic’s articles is using Claude to optimize the tools, prompts, and descriptions that Claude itself uses. This self-improving loop — build → evaluate → let the agent analyze failures → refine → repeat — is the future of agent development.
The gap between prototype and production is wider than anticipated. But with these lessons from Anthropic as your guide, you have a clear roadmap for navigating that gap.