What We Can Learn from Anthropic to Build Our Own Agent

A deep-dive on Anthropic engineering blog posts, distilling every actionable lesson into a practical guide for building production-grade AI agents.

Posted Mar 24, 2026 Updated Mar 24, 2026

By Eric

25 min read

Introduction
The Six Pillars
Lesson 1: Start Simple
Lesson 2: Master the Five Workflow Patterns
Lesson 3: Design Your Agent-Computer Interface
Lesson 4: Context Engineering
Lesson 5: Build Effective Tools
Lesson 6: Harness Long-Running Agents
Lesson 7: Scale with Multi-Agent Architectures
Lesson 8: Evaluation-Driven Development
The Agent Builder’s Checklist
Conclusion

Introduction

Anthropic has published a series of deeply practical engineering blog posts that, taken together, form arguably the most comprehensive public guide to building AI agents today. These aren’t theoretical musings — they are battle-tested lessons from teams that built Claude Code, the Claude Research feature, and worked with dozens of enterprise customers deploying agents in production.

This blog synthesizes all six articles into a single, actionable guide for building your own agent.

The Six Pillars

#	Article	Core Theme	Key Takeaway
1	Building Effective AI Agents	Architecture patterns	Start simple; use composable patterns instead of complex frameworks
2	Effective Context Engineering	Context management	Context is a finite resource — engineer it like a precious budget
3	Writing Effective Tools	Tool design	Tools are contracts between deterministic systems and non-deterministic agents
4	Effective Harnesses for Long-Running Agents	Multi-session persistence	Use initializer + coding agent pattern with structured state handoff
5	How We Built Our Multi-Agent Research System	Multi-agent coordination	Multi-agent systems excel at parallelizable, high-value tasks
6	Demystifying Evals for AI Agents	Evaluation methodology	Eval-driven development is non-negotiable for production agents

Lesson 1: Start Simple — The Complexity Spectrum

Anthropic’s single most repeated piece of advice across all six articles:

“Find the simplest solution possible, and only increase complexity when needed.”

This isn’t just philosophy — it’s a practical heuristic. The most successful agent implementations they’ve seen don’t use complex frameworks. They use simple, composable patterns built directly on LLM APIs.

The Complexity Decision Framework

Can a single optimized LLM call with retrieval and in-context examples solve this? If yes, stop here.
Is the task well-defined with predictable subtasks? Use a workflow (predefined code paths).
Does the task require flexibility and model-driven decisions? Use an agent (LLM controls the flow).

Agents vs Workflows — A Critical Distinction

Workflows: LLMs and tools orchestrated through predefined code paths. The developer controls the flow.
Agents: LLMs dynamically direct their own processes and tool usage. The model controls the flow.

Both are “agentic systems,” but knowing which you need prevents over-engineering.

Lesson 2: Master the Five Workflow Patterns

Anthropic identifies five foundational workflow patterns that cover the vast majority of production use cases. These are building blocks you can combine and customize.

1. Prompt Chaining

Sequential steps where each LLM call processes the output of the previous one. Add programmatic “gates” between steps for quality checks.

Input → [LLM Step 1] → Gate (validate) → [LLM Step 2] → Gate → [LLM Step 3] → Output

When to use: Tasks decomposable into fixed sequential steps. Example: Generate marketing copy → check tone → translate to another language.

2. Routing

Classify the input, then direct it to a specialized handler. This lets you route easy queries to fast/cheap models and hard queries to powerful ones.

Input → [Classifier LLM] → Route A: Handler for refunds
                          → Route B: Handler for technical issues
                          → Route C: Handler for general inquiries

When to use: Distinct input categories needing different handling. Example: Customer support triage, model selection based on complexity.

3. Parallelization

Two variations:

Sectioning: Break task into independent subtasks, run in parallel, merge results.
Voting: Run same task multiple times for diverse perspectives, aggregate.

When to use: Independent subtasks or need for multiple perspectives. Example: Code review where one LLM writes code and another screens it for vulnerabilities. Content moderation with multiple guardrail checks.

4. Orchestrator-Workers

A central LLM dynamically breaks down tasks, delegates to worker LLMs, and synthesizes results. Unlike parallelization, subtasks are NOT predefined — they’re determined at runtime by the orchestrator.

When to use: Complex tasks where subtasks aren’t predictable upfront. Example: Multi-file code changes, multi-source research tasks.

5. Evaluator-Optimizer

One LLM generates a response → another evaluates and provides feedback → the generator refines. This loop continues until quality criteria are met.

[Generator LLM] → output → [Evaluator LLM] → feedback → [Generator LLM] → improved output → ...

When to use: Tasks with clear evaluation criteria where iterative refinement adds measurable value. Example: Literary translation, complex search refinement, code optimization.

Pattern Selection Guide

Pattern	Best For	Example
Prompt Chaining	Fixed sequential steps	Generate → translate → format
Routing	Distinct input categories	Support triage
Parallelization	Independent subtasks / voting	Guardrails, code review
Orchestrator-Workers	Unpredictable subtasks	Multi-file refactoring
Evaluator-Optimizer	Clear quality criteria + iterative refinement	Translation, search

The Autonomous Agent Loop

Beyond workflows, a true autonomous agent is simply an LLM using tools based on environmental feedback in a loop:

while not done:
    action = llm.decide(context, tools)
    result = execute(action)
    context.update(result)
    if llm.should_stop(context):
        done = True

Three core principles for autonomous agents:

Maintain simplicity in the agent loop design
Prioritize transparency — show the agent’s planning steps to the user
Carefully craft the ACI with thorough tool documentation and testing

Lesson 3: Design Your Agent-Computer Interface (ACI)

One of Anthropic’s most original insights is the Agent-Computer Interface (ACI) — the analog of HCI (Human-Computer Interface), but for AI agents. Anthropic reports spending more time optimizing tools than the overall prompt when building their SWE-bench agent.

ACI Design Principles

Put yourself in the model’s shoes. If tool usage isn’t obvious from its description and parameters, it won’t be obvious to the model. Good tool definitions include example usage, edge cases, input format requirements, and clear boundaries.
Think like writing a great docstring for a junior developer. Especially important when using many similar tools — parameter names and descriptions must disambiguate.
Test empirically. Run many example inputs to see what mistakes the model makes, then iterate.
Poka-yoke your tools (error-proof them). Change arguments so mistakes become harder. Example: Anthropic changed their file edit tool to require absolute file paths after seeing errors with relative paths.
Choose formats the model can write easily:
- Don’t require diffs (the model must predict chunk headers before writing code)
- Don’t require JSON-escaped code (extra escaping of newlines and quotes)
- Keep formats close to what appears naturally in training data

Tool Format Decision Guide

Format Choice	Why It Matters
Absolute paths > relative paths	Eliminates state-dependent errors
Markdown code blocks > JSON-wrapped code	No escaping overhead
Full file rewrites > diffs	No need to count lines in advance
Provide “thinking” tokens before committal output	Prevents painting into corners

Lesson 4: Context Engineering — The New Frontier

Anthropic argues that we’re moving beyond “prompt engineering” to context engineering — the art of curating the optimal set of tokens at every inference step.

Why Context Is a Finite Resource

LLMs have limited working memory. Anthropic identifies a phenomenon called context rot: as token count increases, the model’s ability to accurately recall information decreases. This stems from the transformer’s n² pairwise attention relationships — as context grows, attention gets stretched thin.

System Prompts: Find the Right Altitude

Anthropic identifies a “Goldilocks zone” for system prompt specificity:

Failure Mode	Problem	Solution
Too specific (hardcoded if-else logic)	Brittle, high maintenance	Provide heuristics, not hard rules
Too vague (high-level hand-waving)	Model assumes shared context that doesn’t exist	Be specific enough to guide behavior
Right altitude	Specific enough to guide, flexible enough for heuristics	Start minimal, add based on failure modes

Best practices:

Organize into distinct sections (<background_information>, <instructions>, ## Tool guidance)
Use XML tagging or Markdown headers for delineation
Strive for the minimal set of information that fully outlines expected behavior
Start with the best model + minimal prompt, then add instructions based on observed failures

Context Retrieval: Just-in-Time vs. Pre-Computed

Strategy	How It Works	Pros	Cons
Pre-Computed (Classic RAG)	Embed data upfront, retrieve by similarity at query time	Fast, predictable cost	Stale indexes, irrelevant context floods attention
Just-in-Time (Agentic)	Maintain lightweight references; agent dynamically loads data using tools	Always fresh, only relevant context enters window	Slower, requires well-designed tools

Anthropic’s Claude Code uses a hybrid approach: CLAUDE.md files are dropped into context upfront, while glob and grep tools allow just-in-time navigation — bypassing stale indexing issues.

Three Techniques for Long-Horizon Context

Technique	How It Works	Best For
Compaction	Summarize conversation near context limits, reinitialize with the summary	Extensive back-and-forth tasks
Structured Note-Taking	Agent writes persistent notes outside context (e.g., `NOTES.md`, to-do lists)	Iterative development with milestones
Sub-Agent Architectures	Specialized sub-agents handle focused tasks with clean context windows, return condensed summaries	Complex research, parallel exploration

Compaction example from Claude Code: The model summarizes message history, preserving architectural decisions, unresolved bugs, and implementation details while discarding redundant tool outputs. It continues with compressed context plus the five most recently accessed files.

Structured note-taking example from Claude playing Pokémon: The agent maintains precise tallies across thousands of game steps (“for the last 1,234 steps I’ve been training my Pokémon in Route 1, Pikachu has gained 8 levels toward the target of 10”), develops maps of explored regions, and remembers combat strategies — all persisted outside the context window.

Lesson 5: Build Effective Tool Ecosystems

Anthropic’s tool design article introduces a paradigm shift: tools are a new kind of software — a contract between deterministic systems and non-deterministic agents. When a user asks “Should I bring an umbrella today?”, an agent might call the weather tool, answer from general knowledge, or ask a clarifying question. This non-determinism requires fundamentally rethinking how we write software.

Five Principles for Effective Tools

Principle 1: Choose the Right Tools to Implement

More tools ≠ better outcomes. Don’t just wrap every API. Build few, thoughtful tools for high-impact workflows:

Instead of…	Build…
`list_users` + `list_events` + `create_event`	`schedule_event` (handles availability + creation)
`read_logs` (returns everything)	`search_logs` (returns relevant lines + context)
`get_customer` + `list_transactions` + `list_notes`	`get_customer_context` (compiles all relevant info)

Principle 2: Namespace Your Tools

Group related tools under common prefixes to delineate boundaries:

By service: asana_search, jira_search
By resource: asana_projects_search, asana_users_search

Prefix- vs suffix-based naming has non-trivial effects on evaluations — test this choice.

Principle 3: Return Meaningful Context

Prioritize contextual relevance over flexibility. Avoid low-level technical identifiers:

Use name, image_url, file_type — not uuid, 256px_image_url, mime_type
Resolving UUIDs to natural language names significantly reduces hallucinations
Expose a response_format enum ("concise" vs "detailed") to let agents control verbosity

Principle 4: Optimize for Token Efficiency

Implement pagination, filtering, truncation with sensible defaults
Claude Code: 25,000 token default limit per tool response
Steer agents toward small, targeted searches instead of single broad ones
Prompt-engineer error responses to communicate specific, actionable improvements (not opaque tracebacks)

Principle 5: Prompt-Engineer Your Tool Descriptions

Tool descriptions are loaded into the agent’s context — they collectively steer behavior. Think of describing your tool to a new hire:

Make implicit context explicit (specialized query formats, niche terminology, resource relationships)
Use unambiguous parameter names: user_id not user
Impact: Claude Sonnet 3.5 achieved SWE-bench SOTA after precise tool description refinements
Impact: A tool-testing agent that rewrote descriptions achieved 40% decrease in task completion time

The Evaluation-Driven Tool Improvement Loop

Anthropic’s most powerful technique: use agents to improve the tools you give them.

Build a prototype of your tools (use Claude Code to one-shot them)
Create evaluation tasks grounded in real-world uses — strong tasks require multiple tool calls
Run the evaluation with simple agentic loops (while-loops wrapping alternating LLM + tool calls)
Analyze results — read transcripts, observe confusion, track tool-calling metrics
Let Claude analyze transcripts and refactor tools — paste eval transcripts into Claude Code
Repeat until strong performance on held-out test sets

This process yielded improvements beyond expert human-written tools in Anthropic’s Slack and Asana evaluations.

Lesson 6: Harness Long-Running Agents

For tasks spanning hours or days, agents face the “shift change” problem: each new context window starts with no memory of what came before. Anthropic developed a two-part solution.

The Initializer + Coding Agent Pattern

Key Insights for Long-Running Agents

Use JSON for feature lists — models are less likely to inappropriately modify JSON compared to Markdown
Git is your state management system — agents can revert bad changes and recover working states
Always test before implementing — start each session with a basic end-to-end test to catch broken state from the previous session
Explicitly require end-to-end testing — without prompting, agents tend to declare features complete after writing code without verifying them as a real user would
Long-horizon conversation management — agents should summarize completed work phases and store essential information in external memory before proceeding to new tasks
Subagent output to filesystem — subagents can write outputs directly to external systems (files, databases), then pass lightweight references back to the coordinator, preventing information loss during multi-stage processing

Lesson 7: Scale with Multi-Agent Architectures

Anthropic’s Research feature uses a multi-agent orchestrator-worker system where a lead agent coordinates while specialized subagents operate in parallel. The results: multi-agent Claude Opus 4 + Sonnet 4 subagents outperformed single-agent Opus 4 by 90.2% on their internal research eval.

Why Multi-Agent Works

In Anthropic’s analysis, three factors explained 95% of performance variance on BrowseComp:

Token usage (80% of variance by itself)
Number of tool calls
Model choice

Multi-agent architectures scale token usage by distributing work across agents with separate context windows. Each subagent explores extensively but returns only a condensed summary.

The Token Economics

Interaction Type	Relative Token Usage
Chat	1× (baseline)
Single Agent	~4×
Multi-Agent	~15×

Multi-agent systems burn through tokens fast. They require high-value tasks where performance gains justify the cost. Best fit: tasks involving heavy parallelization, information exceeding single context windows, and interfacing with numerous complex tools.

The Architecture: Orchestrator-Worker

User Query
    ↓
[Lead Researcher Agent] ← Extended thinking for planning
    ↓ saves plan to Memory
    ├── [Subagent 1: "AI startups 2025"] ← interleaved thinking after each tool result
    ├── [Subagent 2: "Enterprise AI adoption"] ← parallel web search
    └── [Subagent 3: "AI regulation landscape"] ← independent context window
    ↓ condensed findings returned
[Lead Researcher Agent] ← synthesizes, decides if more research needed
    ↓ if yes → spawn more subagents
    ↓ if done →
[Citation Agent] ← processes documents, attributes sources
    ↓
Final Research Report with Citations

Seven Principles for Multi-Agent Prompting

1. Think like your agents. Build simulations with the exact prompts and tools from your system. Watch agents work step-by-step to reveal failure modes (agents continuing when they already have results, verbose search queries, wrong tool selection).

2. Teach the orchestrator how to delegate. Each subagent needs: an objective, an output format, guidance on tools/sources, and clear task boundaries. Without detail, agents duplicate work or leave gaps. Example failure: one subagent explored the 2021 automotive chip crisis while two others duplicated work on 2025 supply chains.

3. Scale effort to query complexity. Embed scaling rules in prompts:

Simple fact-finding: 1 agent, 3-10 tool calls
Direct comparisons: 2-4 subagents, 10-15 calls each
Complex research: 10+ subagents with clearly divided responsibilities

4. Tool design and selection are critical. An agent searching the web for context that only exists in Slack is doomed. Give agents explicit heuristics: examine all available tools first, match tool usage to user intent, prefer specialized tools over generic ones.

5. Let agents improve themselves. Claude 4 models are excellent prompt engineers. A tool-testing agent that tested an MCP tool dozens of times and rewrote its description achieved a 40% decrease in task completion time for future agents.

6. Start wide, then narrow down. Mirror expert human research: explore the landscape before drilling into specifics. Agents default to overly long, specific queries that return few results. Prompt agents to start with short, broad queries, then progressively narrow.

7. Guide the thinking process + parallelize. Extended thinking serves as a controllable scratchpad. Interleaved thinking after tool results helps agents evaluate quality and identify gaps. Parallel tool calling transforms speed:

Lead agent spins up 3-5 subagents in parallel (not serially)
Subagents use 3+ tools in parallel
Result: up to 90% reduction in research time for complex queries

Production Reliability Challenges

Challenge	Solution
Agents are stateful; errors compound	Build resume-from-checkpoint systems; use retry logic + regular checkpoints; let agents know when tools fail and adapt
Non-deterministic debugging	Add full production tracing; monitor agent decision patterns without reading conversation content
Deployment disrupts running agents	Use rainbow deployments — gradually shift traffic old → new while both run simultaneously
Synchronous execution bottlenecks	Consider async execution for more parallelism, but manage coordination and state consistency carefully

Lesson 8: Evaluation-Driven Development

Anthropic’s evaluation article is perhaps the most practically important for anyone building production agents. Their core message: evals are non-negotiable, and their value compounds over the entire agent lifecycle.

Why Evals Matter

“Writing evals is useful at any stage. Early on, evals force product teams to specify what success means. Later, they uphold a consistent quality bar.”

Teams without evals face weeks of testing when new models come out; teams with evals can upgrade in days. Evals also become the highest-bandwidth communication channel between product and research teams, defining metrics researchers can optimize against.

Real-world examples:

Descript built evals around three dimensions: don’t break things, do what I asked, do it well. They evolved from manual grading to LLM judges with periodic human calibration.
Bolt built an eval system in 3 months that runs their agent and grades outputs with static analysis, browser agents, and LLM judges.
Claude Code started with fast iteration from user feedback, then added evals — first for narrow areas like code generation, then expanding into broader behavioral evals.

The Structure of an Agent Evaluation

An evaluation consists of:

Task: An input + environment for the agent (e.g., “fix this auth bypass vulnerability”)
Graders: Logic that scores some aspect of the agent’s performance. A task can have multiple graders with multiple assertions.
Transcript/Trace: The full record of the agent’s actions (tool calls, reasoning, outputs)
Metrics: Quantitative measurements (turns taken, tokens used, latency)
Eval Suite: A collection of tasks designed to measure specific capabilities

Three Types of Graders

Grader Type	Methods	Strengths	Weaknesses
Code-Based	String match, binary tests, static analysis, tool call verification, transcript analysis	Fast, cheap, objective, reproducible	Brittle to valid variations, lacking nuance
Model-Based	LLM-as-judge with rubrics, pairwise comparison, multi-dimension scoring	Flexible, handles nuance, scales well	Requires calibration, non-deterministic
Human	Expert review, user testing, adversarial testing	Catches what automation misses, finds edge cases	Expensive, slow, doesn’t scale

Best practice: Use deterministic graders where possible, LLM graders where necessary, and human graders for validation. Don’t over-specify the agent’s path — grade what the agent produced, not the path it took.

Capability vs. Regression Evals

Evaluating Different Agent Types

Coding Agents: Rely on well-specified tasks, stable test environments, and thorough tests. Deterministic graders are natural — does the code run and do the tests pass? SWE-bench Verified and Terminal-Bench follow this approach. LLMs progressed from 40% to >80% on SWE-bench in just one year.

Conversational Agents: Require a second LLM to simulate the user. Success is multidimensional: is the ticket resolved (state check), did it finish in <10 turns (transcript constraint), was the tone appropriate (LLM rubric)?

Research Agents: Combine groundedness checks (claims supported by sources), coverage checks (key facts included), and source quality checks (authoritative vs. first-retrieved). LLM rubrics should be frequently calibrated against expert human judgment.

Computer Use Agents: Require running the agent in a real or sandboxed environment and checking whether it achieved the intended outcome. Balance DOM-based interactions (fast, token-heavy) with screenshot-based interactions (slower, token-efficient).

Handling Non-Determinism

Agent behavior varies between runs. Two metrics help:

pass@k: Probability of at least one success in k attempts. As k rises, score rises. Useful when one success is sufficient.
pass^k: Probability of ALL k trials succeeding. As k rises, score falls. Useful for customer-facing agents where consistency matters.

At k=1 they’re identical. At k=10 they tell opposite stories: pass@k approaches 100% while pass^k approaches 0%.

The Roadmap from Zero to Great Evals

Step	Action	Detail
0	Start early	20-50 tasks from real failures is a great start. Don’t wait for hundreds.
1	Automate manual checks	Convert bug reports and manual QA into test cases. Prioritize by user impact.
2	Write unambiguous tasks	Two experts should independently reach the same pass/fail verdict. Create reference solutions.
3	Build balanced problem sets	Test both when behavior SHOULD occur and when it SHOULDN’T. Avoid class-imbalanced evals.
4	Build a robust eval harness	Each trial from a clean environment. No shared state between runs.
5	Design graders thoughtfully	Grade what was produced, not the path taken. Build in partial credit for multi-component tasks.
6	Run and iterate	Track latency, tokens, cost, error rates alongside accuracy. Calibrate LLM judges against humans.

Critical lesson: Opus 4.5 initially scored 42% on CORE-Bench, but after fixing grading bugs, ambiguous specs, and stochastic tasks, the score jumped to 95%. A 0% pass@100 rate usually signals a broken task, not an incapable agent.

The Agent Builder’s Checklist

Phase 1: Design

Start with the simplest possible solution (single LLM call + retrieval)
Only add agent complexity when simpler solutions demonstrably fall short
Choose the right pattern: workflow (predefined) vs agent (model-directed)
Define success criteria BEFORE building — write initial eval tasks

Phase 2: Tools & Context

Design ACI with same care as HCI — invest heavily in tool documentation and testing
Build few, thoughtful tools (not 1:1 API wrappers); consolidate multi-step operations
Namespace tools clearly; test prefix vs suffix naming
Optimize tool responses for token efficiency (pagination, filtering, truncation)
Engineer context at every inference step — treat tokens as a finite budget
Use hybrid retrieval: static context files + just-in-time tool-based retrieval

Phase 3: Implementation

Keep agent loops simple: while loop with LLM decision → tool execution → context update
Implement compaction for long conversations (summarize near context limits)
Use structured note-taking for multi-session persistence (progress files, feature lists)
Use git for state management in long-running coding agents
Test before implementing each session — catch broken state early

Phase 4: Multi-Agent (Only If Needed)

Verify the task genuinely benefits from parallelization
Use orchestrator-worker pattern with clear delegation instructions
Scale effort to query complexity (embed rules in prompts)
Enable parallel tool calling for both lead agent and subagents
Use extended thinking + interleaved thinking for planning and evaluation
Build resume-from-checkpoint systems for error recovery

Phase 5: Evaluation

Start with 20-50 tasks from real usage patterns and failures
Combine code-based, model-based, and human graders
Maintain separate capability evals (hill to climb) and regression evals (catch drift)
Run evals in isolated, clean environments — no shared state
Graduate high-performing capability evals to regression suites
Track pass@k for tasks where one success matters; pass^k where consistency matters
Use agents to analyze eval transcripts and improve tools iteratively

Conclusion

The lessons from Anthropic’s six engineering articles converge on a single philosophy: simplicity is not the starting point — it’s the goal. The teams that succeed with agents are not those who build the most complex systems, but those who find the minimum viable complexity for their task and then invest ruthlessly in the details that matter — tool design, context engineering, and evaluation.

Here are the five meta-lessons that emerge from synthesizing all six articles:

Complexity is a cost, not a feature. Every layer you add (multi-agent, long-horizon, async execution) introduces new failure modes. Add layers only when measurement proves they help.
Tools are the new UX. The Agent-Computer Interface deserves the same design rigor as a human-facing product. Anthropic spent more time on tool design than prompt design for their best agents.
Context is the bottleneck. It’s not about how much you can fit in the context window — it’s about the signal-to-noise ratio of every token. Engineer context like you engineer code: deliberately, testably, and with a bias toward deletion.
Evals are the foundation. Without evals, you’re guessing. With evals, you’re engineering. Start early, start small, and let your eval suite compound in value over time.
Let agents improve agents. One of the most striking patterns across Anthropic’s articles is using Claude to optimize the tools, prompts, and descriptions that Claude itself uses. This self-improving loop — build → evaluate → let the agent analyze failures → refine → repeat — is the future of agent development.

The gap between prototype and production is wider than anticipated. But with these lessons from Anthropic as your guide, you have a clear roadmap for navigating that gap.

Learning

This post is licensed under CC BY 4.0 by the author.