Post

What We Can Learn from Anthropic to Build Our Own Agent

A deep-dive on Anthropic engineering blog posts, distilling every actionable lesson into a practical guide for building production-grade AI agents.

What We Can Learn from Anthropic to Build Our Own Agent

Table of Contents

  1. Introduction
  2. The Six Pillars
  3. Lesson 1: Start Simple
  4. Lesson 2: Master the Five Workflow Patterns
  5. Lesson 3: Design Your Agent-Computer Interface
  6. Lesson 4: Context Engineering
  7. Lesson 5: Build Effective Tools
  8. Lesson 6: Harness Long-Running Agents
  9. Lesson 7: Scale with Multi-Agent Architectures
  10. Lesson 8: Evaluation-Driven Development
  11. The Agent Builder’s Checklist
  12. Conclusion

Introduction

Anthropic has published a series of deeply practical engineering blog posts that, taken together, form arguably the most comprehensive public guide to building AI agents today. These aren’t theoretical musings — they are battle-tested lessons from teams that built Claude Code, the Claude Research feature, and worked with dozens of enterprise customers deploying agents in production.

This blog synthesizes all six articles into a single, actionable guide for building your own agent.


The Six Pillars

#ArticleCore ThemeKey Takeaway
1Building Effective AI AgentsArchitecture patternsStart simple; use composable patterns instead of complex frameworks
2Effective Context EngineeringContext managementContext is a finite resource — engineer it like a precious budget
3Writing Effective ToolsTool designTools are contracts between deterministic systems and non-deterministic agents
4Effective Harnesses for Long-Running AgentsMulti-session persistenceUse initializer + coding agent pattern with structured state handoff
5How We Built Our Multi-Agent Research SystemMulti-agent coordinationMulti-agent systems excel at parallelizable, high-value tasks
6Demystifying Evals for AI AgentsEvaluation methodologyEval-driven development is non-negotiable for production agents
The Six Pillars of Anthropic's Agent Engineering Knowledge 1. Building Effective AI Agents Patterns & Architecture 2. Context Engineering for AI Agents Managing the Attention Budget 3. Writing Effective Tools for AI Agents Tool Design & Optimization 4. Harnesses for Long-Running Agents Multi-Session Persistence 5. Multi-Agent Research System Orchestrator-Worker at Scale 6. Demystifying Evals for AI Agents Measurement & Quality

Lesson 1: Start Simple — The Complexity Spectrum

Anthropic’s single most repeated piece of advice across all six articles:

“Find the simplest solution possible, and only increase complexity when needed.”

This isn’t just philosophy — it’s a practical heuristic. The most successful agent implementations they’ve seen don’t use complex frameworks. They use simple, composable patterns built directly on LLM APIs.

The Complexity Decision Framework

  1. Can a single optimized LLM call with retrieval and in-context examples solve this? If yes, stop here.
  2. Is the task well-defined with predictable subtasks? Use a workflow (predefined code paths).
  3. Does the task require flexibility and model-driven decisions? Use an agent (LLM controls the flow).
The Complexity Spectrum: When to Use What Low Complexity High Complexity Single LLM Call + retrieval + examples Classification Summarization, Q&A Workflows Predefined code paths Prompt chaining, Routing Parallelization Single Agent LLM directs its process Coding agents Support, Computer use Multi-Agent Separate context windows Deep research Parallel exploration Trade-offs as complexity increases: Higher: Latency | Cost | Error Compounding | Debugging Difficulty Higher: Flexibility | Task Scope | Autonomy | Parallel Capacity Anthropic's Rule: Add complexity ONLY when it demonstrably improves outcomes

Agents vs Workflows — A Critical Distinction

  • Workflows: LLMs and tools orchestrated through predefined code paths. The developer controls the flow.
  • Agents: LLMs dynamically direct their own processes and tool usage. The model controls the flow.

Both are “agentic systems,” but knowing which you need prevents over-engineering.


Lesson 2: Master the Five Workflow Patterns

Anthropic identifies five foundational workflow patterns that cover the vast majority of production use cases. These are building blocks you can combine and customize.

1. Prompt Chaining

Sequential steps where each LLM call processes the output of the previous one. Add programmatic “gates” between steps for quality checks.

1
Input → [LLM Step 1] → Gate (validate) → [LLM Step 2] → Gate → [LLM Step 3] → Output

When to use: Tasks decomposable into fixed sequential steps. Example: Generate marketing copy → check tone → translate to another language.

2. Routing

Classify the input, then direct it to a specialized handler. This lets you route easy queries to fast/cheap models and hard queries to powerful ones.

1
2
3
Input → [Classifier LLM] → Route A: Handler for refunds
                          → Route B: Handler for technical issues
                          → Route C: Handler for general inquiries

When to use: Distinct input categories needing different handling. Example: Customer support triage, model selection based on complexity.

3. Parallelization

Two variations:

  • Sectioning: Break task into independent subtasks, run in parallel, merge results.
  • Voting: Run same task multiple times for diverse perspectives, aggregate.

When to use: Independent subtasks or need for multiple perspectives. Example: Code review where one LLM writes code and another screens it for vulnerabilities. Content moderation with multiple guardrail checks.

4. Orchestrator-Workers

A central LLM dynamically breaks down tasks, delegates to worker LLMs, and synthesizes results. Unlike parallelization, subtasks are NOT predefined — they’re determined at runtime by the orchestrator.

When to use: Complex tasks where subtasks aren’t predictable upfront. Example: Multi-file code changes, multi-source research tasks.

5. Evaluator-Optimizer

One LLM generates a response → another evaluates and provides feedback → the generator refines. This loop continues until quality criteria are met.

1
[Generator LLM] → output → [Evaluator LLM] → feedback → [Generator LLM] → improved output → ...

When to use: Tasks with clear evaluation criteria where iterative refinement adds measurable value. Example: Literary translation, complex search refinement, code optimization.

Pattern Selection Guide

PatternBest ForExample
Prompt ChainingFixed sequential stepsGenerate → translate → format
RoutingDistinct input categoriesSupport triage
ParallelizationIndependent subtasks / votingGuardrails, code review
Orchestrator-WorkersUnpredictable subtasksMulti-file refactoring
Evaluator-OptimizerClear quality criteria + iterative refinementTranslation, search

The Autonomous Agent Loop

Beyond workflows, a true autonomous agent is simply an LLM using tools based on environmental feedback in a loop:

1
2
3
4
5
6
while not done:
    action = llm.decide(context, tools)
    result = execute(action)
    context.update(result)
    if llm.should_stop(context):
        done = True

Three core principles for autonomous agents:

  1. Maintain simplicity in the agent loop design
  2. Prioritize transparency — show the agent’s planning steps to the user
  3. Carefully craft the ACI with thorough tool documentation and testing

Lesson 3: Design Your Agent-Computer Interface (ACI)

One of Anthropic’s most original insights is the Agent-Computer Interface (ACI) — the analog of HCI (Human-Computer Interface), but for AI agents. Anthropic reports spending more time optimizing tools than the overall prompt when building their SWE-bench agent.

ACI Design Principles

  1. Put yourself in the model’s shoes. If tool usage isn’t obvious from its description and parameters, it won’t be obvious to the model. Good tool definitions include example usage, edge cases, input format requirements, and clear boundaries.

  2. Think like writing a great docstring for a junior developer. Especially important when using many similar tools — parameter names and descriptions must disambiguate.

  3. Test empirically. Run many example inputs to see what mistakes the model makes, then iterate.

  4. Poka-yoke your tools (error-proof them). Change arguments so mistakes become harder. Example: Anthropic changed their file edit tool to require absolute file paths after seeing errors with relative paths.

  5. Choose formats the model can write easily:

    • Don’t require diffs (the model must predict chunk headers before writing code)
    • Don’t require JSON-escaped code (extra escaping of newlines and quotes)
    • Keep formats close to what appears naturally in training data

Tool Format Decision Guide

Format ChoiceWhy It Matters
Absolute paths > relative pathsEliminates state-dependent errors
Markdown code blocks > JSON-wrapped codeNo escaping overhead
Full file rewrites > diffsNo need to count lines in advance
Provide “thinking” tokens before committal outputPrevents painting into corners

Lesson 4: Context Engineering — The New Frontier

Anthropic argues that we’re moving beyond “prompt engineering” to context engineering — the art of curating the optimal set of tokens at every inference step.

Why Context Is a Finite Resource

LLMs have limited working memory. Anthropic identifies a phenomenon called context rot: as token count increases, the model’s ability to accurately recall information decreases. This stems from the transformer’s n² pairwise attention relationships — as context grows, attention gets stretched thin.

From Prompt Engineering to Context Engineering Prompt Engineering (Old) Focus: Writing the perfect system prompt Scope: Static, one-shot tasks Approach: Craft the right words and phrases Context: Mostly just the prompt itself Best for: Classification, summarization, Q&A Context Engineering (New) Focus: Curating ALL tokens at each inference step Scope: Multi-turn, long-horizon agentic tasks Approach: Optimize the entire context state Context: System prompt + tools + history + data Best for: Agents running in loops over many turns The Attention Budget Principle "Find the smallest possible set of high-signal tokens that maximize the likelihood of the desired outcome."

System Prompts: Find the Right Altitude

Anthropic identifies a “Goldilocks zone” for system prompt specificity:

Failure ModeProblemSolution
Too specific (hardcoded if-else logic)Brittle, high maintenanceProvide heuristics, not hard rules
Too vague (high-level hand-waving)Model assumes shared context that doesn’t existBe specific enough to guide behavior
Right altitudeSpecific enough to guide, flexible enough for heuristicsStart minimal, add based on failure modes

Best practices:

  • Organize into distinct sections (<background_information>, <instructions>, ## Tool guidance)
  • Use XML tagging or Markdown headers for delineation
  • Strive for the minimal set of information that fully outlines expected behavior
  • Start with the best model + minimal prompt, then add instructions based on observed failures

Context Retrieval: Just-in-Time vs. Pre-Computed

StrategyHow It WorksProsCons
Pre-Computed (Classic RAG)Embed data upfront, retrieve by similarity at query timeFast, predictable costStale indexes, irrelevant context floods attention
Just-in-Time (Agentic)Maintain lightweight references; agent dynamically loads data using toolsAlways fresh, only relevant context enters windowSlower, requires well-designed tools

Anthropic’s Claude Code uses a hybrid approach: CLAUDE.md files are dropped into context upfront, while glob and grep tools allow just-in-time navigation — bypassing stale indexing issues.

Three Techniques for Long-Horizon Context

TechniqueHow It WorksBest For
CompactionSummarize conversation near context limits, reinitialize with the summaryExtensive back-and-forth tasks
Structured Note-TakingAgent writes persistent notes outside context (e.g., NOTES.md, to-do lists)Iterative development with milestones
Sub-Agent ArchitecturesSpecialized sub-agents handle focused tasks with clean context windows, return condensed summariesComplex research, parallel exploration

Compaction example from Claude Code: The model summarizes message history, preserving architectural decisions, unresolved bugs, and implementation details while discarding redundant tool outputs. It continues with compressed context plus the five most recently accessed files.

Structured note-taking example from Claude playing Pokémon: The agent maintains precise tallies across thousands of game steps (“for the last 1,234 steps I’ve been training my Pokémon in Route 1, Pikachu has gained 8 levels toward the target of 10”), develops maps of explored regions, and remembers combat strategies — all persisted outside the context window.


Lesson 5: Build Effective Tool Ecosystems

Anthropic’s tool design article introduces a paradigm shift: tools are a new kind of software — a contract between deterministic systems and non-deterministic agents. When a user asks “Should I bring an umbrella today?”, an agent might call the weather tool, answer from general knowledge, or ask a clarifying question. This non-determinism requires fundamentally rethinking how we write software.

Five Principles for Effective Tools

Principle 1: Choose the Right Tools to Implement

More tools ≠ better outcomes. Don’t just wrap every API. Build few, thoughtful tools for high-impact workflows:

Instead of…Build…
list_users + list_events + create_eventschedule_event (handles availability + creation)
read_logs (returns everything)search_logs (returns relevant lines + context)
get_customer + list_transactions + list_notesget_customer_context (compiles all relevant info)

Principle 2: Namespace Your Tools

Group related tools under common prefixes to delineate boundaries:

  • By service: asana_search, jira_search
  • By resource: asana_projects_search, asana_users_search

Prefix- vs suffix-based naming has non-trivial effects on evaluations — test this choice.

Principle 3: Return Meaningful Context

Prioritize contextual relevance over flexibility. Avoid low-level technical identifiers:

  • Use name, image_url, file_typenot uuid, 256px_image_url, mime_type
  • Resolving UUIDs to natural language names significantly reduces hallucinations
  • Expose a response_format enum ("concise" vs "detailed") to let agents control verbosity

Principle 4: Optimize for Token Efficiency

  • Implement pagination, filtering, truncation with sensible defaults
  • Claude Code: 25,000 token default limit per tool response
  • Steer agents toward small, targeted searches instead of single broad ones
  • Prompt-engineer error responses to communicate specific, actionable improvements (not opaque tracebacks)

Principle 5: Prompt-Engineer Your Tool Descriptions

Tool descriptions are loaded into the agent’s context — they collectively steer behavior. Think of describing your tool to a new hire:

  • Make implicit context explicit (specialized query formats, niche terminology, resource relationships)
  • Use unambiguous parameter names: user_id not user
  • Impact: Claude Sonnet 3.5 achieved SWE-bench SOTA after precise tool description refinements
  • Impact: A tool-testing agent that rewrote descriptions achieved 40% decrease in task completion time

The Evaluation-Driven Tool Improvement Loop

Anthropic’s most powerful technique: use agents to improve the tools you give them.

  1. Build a prototype of your tools (use Claude Code to one-shot them)
  2. Create evaluation tasks grounded in real-world uses — strong tasks require multiple tool calls
  3. Run the evaluation with simple agentic loops (while-loops wrapping alternating LLM + tool calls)
  4. Analyze results — read transcripts, observe confusion, track tool-calling metrics
  5. Let Claude analyze transcripts and refactor tools — paste eval transcripts into Claude Code
  6. Repeat until strong performance on held-out test sets

This process yielded improvements beyond expert human-written tools in Anthropic’s Slack and Asana evaluations.


Lesson 6: Harness Long-Running Agents

For tasks spanning hours or days, agents face the “shift change” problem: each new context window starts with no memory of what came before. Anthropic developed a two-part solution.

The Initializer + Coding Agent Pattern

Long-Running Agent Architecture Initializer Agent (Session 1) Creates the foundation for all future sessions: Feature list (JSON) — all features marked "failing" claude-progress.txt — log of what agents have done init.sh — script to run the development server Initial git commit — baseline for all future work then Coding Agent (Every session after) Each session follows this sequence: 1. Run pwd, read progress file + git log 2. Read feature list, choose highest-priority feature 3. Run init.sh, test basic functionality first 4. Implement ONE feature, test end-to-end 5. Git commit + update progress + mark feature done Four Failure Modes and Solutions 1. Agent tries to one-shot everything Feature list forces incremental, one-at-a-time work 2. Agent declares "done" prematurely Pass/fail flags; "do not remove tests" instructions 3. Agent leaves environment broken Git commits + progress notes; start with basic test 4. Agent skips proper testing Explicitly prompt for browser automation (Puppeteer)

Key Insights for Long-Running Agents

  • Use JSON for feature lists — models are less likely to inappropriately modify JSON compared to Markdown
  • Git is your state management system — agents can revert bad changes and recover working states
  • Always test before implementing — start each session with a basic end-to-end test to catch broken state from the previous session
  • Explicitly require end-to-end testing — without prompting, agents tend to declare features complete after writing code without verifying them as a real user would
  • Long-horizon conversation management — agents should summarize completed work phases and store essential information in external memory before proceeding to new tasks
  • Subagent output to filesystem — subagents can write outputs directly to external systems (files, databases), then pass lightweight references back to the coordinator, preventing information loss during multi-stage processing

Lesson 7: Scale with Multi-Agent Architectures

Anthropic’s Research feature uses a multi-agent orchestrator-worker system where a lead agent coordinates while specialized subagents operate in parallel. The results: multi-agent Claude Opus 4 + Sonnet 4 subagents outperformed single-agent Opus 4 by 90.2% on their internal research eval.

Why Multi-Agent Works

In Anthropic’s analysis, three factors explained 95% of performance variance on BrowseComp:

  1. Token usage (80% of variance by itself)
  2. Number of tool calls
  3. Model choice

Multi-agent architectures scale token usage by distributing work across agents with separate context windows. Each subagent explores extensively but returns only a condensed summary.

The Token Economics

Interaction TypeRelative Token Usage
Chat1× (baseline)
Single Agent~4×
Multi-Agent~15×

Multi-agent systems burn through tokens fast. They require high-value tasks where performance gains justify the cost. Best fit: tasks involving heavy parallelization, information exceeding single context windows, and interfacing with numerous complex tools.

The Architecture: Orchestrator-Worker

1
2
3
4
5
6
7
8
9
10
11
12
13
14
User Query
    ↓
[Lead Researcher Agent] ← Extended thinking for planning
    ↓ saves plan to Memory
    ├── [Subagent 1: "AI startups 2025"] ← interleaved thinking after each tool result
    ├── [Subagent 2: "Enterprise AI adoption"] ← parallel web search
    └── [Subagent 3: "AI regulation landscape"] ← independent context window
    ↓ condensed findings returned
[Lead Researcher Agent] ← synthesizes, decides if more research needed
    ↓ if yes → spawn more subagents
    ↓ if done →
[Citation Agent] ← processes documents, attributes sources
    ↓
Final Research Report with Citations

Seven Principles for Multi-Agent Prompting

1. Think like your agents. Build simulations with the exact prompts and tools from your system. Watch agents work step-by-step to reveal failure modes (agents continuing when they already have results, verbose search queries, wrong tool selection).

2. Teach the orchestrator how to delegate. Each subagent needs: an objective, an output format, guidance on tools/sources, and clear task boundaries. Without detail, agents duplicate work or leave gaps. Example failure: one subagent explored the 2021 automotive chip crisis while two others duplicated work on 2025 supply chains.

3. Scale effort to query complexity. Embed scaling rules in prompts:

  • Simple fact-finding: 1 agent, 3-10 tool calls
  • Direct comparisons: 2-4 subagents, 10-15 calls each
  • Complex research: 10+ subagents with clearly divided responsibilities

4. Tool design and selection are critical. An agent searching the web for context that only exists in Slack is doomed. Give agents explicit heuristics: examine all available tools first, match tool usage to user intent, prefer specialized tools over generic ones.

5. Let agents improve themselves. Claude 4 models are excellent prompt engineers. A tool-testing agent that tested an MCP tool dozens of times and rewrote its description achieved a 40% decrease in task completion time for future agents.

6. Start wide, then narrow down. Mirror expert human research: explore the landscape before drilling into specifics. Agents default to overly long, specific queries that return few results. Prompt agents to start with short, broad queries, then progressively narrow.

7. Guide the thinking process + parallelize. Extended thinking serves as a controllable scratchpad. Interleaved thinking after tool results helps agents evaluate quality and identify gaps. Parallel tool calling transforms speed:

  • Lead agent spins up 3-5 subagents in parallel (not serially)
  • Subagents use 3+ tools in parallel
  • Result: up to 90% reduction in research time for complex queries

Production Reliability Challenges

ChallengeSolution
Agents are stateful; errors compoundBuild resume-from-checkpoint systems; use retry logic + regular checkpoints; let agents know when tools fail and adapt
Non-deterministic debuggingAdd full production tracing; monitor agent decision patterns without reading conversation content
Deployment disrupts running agentsUse rainbow deployments — gradually shift traffic old → new while both run simultaneously
Synchronous execution bottlenecksConsider async execution for more parallelism, but manage coordination and state consistency carefully

Lesson 8: Evaluation-Driven Development

Anthropic’s evaluation article is perhaps the most practically important for anyone building production agents. Their core message: evals are non-negotiable, and their value compounds over the entire agent lifecycle.

Why Evals Matter

“Writing evals is useful at any stage. Early on, evals force product teams to specify what success means. Later, they uphold a consistent quality bar.”

Teams without evals face weeks of testing when new models come out; teams with evals can upgrade in days. Evals also become the highest-bandwidth communication channel between product and research teams, defining metrics researchers can optimize against.

Real-world examples:

  • Descript built evals around three dimensions: don’t break things, do what I asked, do it well. They evolved from manual grading to LLM judges with periodic human calibration.
  • Bolt built an eval system in 3 months that runs their agent and grades outputs with static analysis, browser agents, and LLM judges.
  • Claude Code started with fast iteration from user feedback, then added evals — first for narrow areas like code generation, then expanding into broader behavioral evals.

The Structure of an Agent Evaluation

An evaluation consists of:

  • Task: An input + environment for the agent (e.g., “fix this auth bypass vulnerability”)
  • Graders: Logic that scores some aspect of the agent’s performance. A task can have multiple graders with multiple assertions.
  • Transcript/Trace: The full record of the agent’s actions (tool calls, reasoning, outputs)
  • Metrics: Quantitative measurements (turns taken, tokens used, latency)
  • Eval Suite: A collection of tasks designed to measure specific capabilities

Three Types of Graders

Grader TypeMethodsStrengthsWeaknesses
Code-BasedString match, binary tests, static analysis, tool call verification, transcript analysisFast, cheap, objective, reproducibleBrittle to valid variations, lacking nuance
Model-BasedLLM-as-judge with rubrics, pairwise comparison, multi-dimension scoringFlexible, handles nuance, scales wellRequires calibration, non-deterministic
HumanExpert review, user testing, adversarial testingCatches what automation misses, finds edge casesExpensive, slow, doesn’t scale

Best practice: Use deterministic graders where possible, LLM graders where necessary, and human graders for validation. Don’t over-specify the agent’s path — grade what the agent produced, not the path it took.

Capability vs. Regression Evals

Capability Evals vs. Regression Evals Capability / Quality Evals "What can this agent do well?" Start at LOW pass rate Target tasks the agent struggles with Give teams a "hill to climb" Direction: scores should go UP over time When pass rate is high → graduate to regression suite Regression Evals "Does it still handle what it used to?" Should have ~100% pass rate Protect against backsliding Score decline = something is broken Direction: scores should STAY at 100% Run continuously to catch drift as you hill-climb

Evaluating Different Agent Types

Coding Agents: Rely on well-specified tasks, stable test environments, and thorough tests. Deterministic graders are natural — does the code run and do the tests pass? SWE-bench Verified and Terminal-Bench follow this approach. LLMs progressed from 40% to >80% on SWE-bench in just one year.

Conversational Agents: Require a second LLM to simulate the user. Success is multidimensional: is the ticket resolved (state check), did it finish in <10 turns (transcript constraint), was the tone appropriate (LLM rubric)?

Research Agents: Combine groundedness checks (claims supported by sources), coverage checks (key facts included), and source quality checks (authoritative vs. first-retrieved). LLM rubrics should be frequently calibrated against expert human judgment.

Computer Use Agents: Require running the agent in a real or sandboxed environment and checking whether it achieved the intended outcome. Balance DOM-based interactions (fast, token-heavy) with screenshot-based interactions (slower, token-efficient).

Handling Non-Determinism

Agent behavior varies between runs. Two metrics help:

  • pass@k: Probability of at least one success in k attempts. As k rises, score rises. Useful when one success is sufficient.
  • pass^k: Probability of ALL k trials succeeding. As k rises, score falls. Useful for customer-facing agents where consistency matters.

At k=1 they’re identical. At k=10 they tell opposite stories: pass@k approaches 100% while pass^k approaches 0%.

The Roadmap from Zero to Great Evals

StepActionDetail
0Start early20-50 tasks from real failures is a great start. Don’t wait for hundreds.
1Automate manual checksConvert bug reports and manual QA into test cases. Prioritize by user impact.
2Write unambiguous tasksTwo experts should independently reach the same pass/fail verdict. Create reference solutions.
3Build balanced problem setsTest both when behavior SHOULD occur and when it SHOULDN’T. Avoid class-imbalanced evals.
4Build a robust eval harnessEach trial from a clean environment. No shared state between runs.
5Design graders thoughtfullyGrade what was produced, not the path taken. Build in partial credit for multi-component tasks.
6Run and iterateTrack latency, tokens, cost, error rates alongside accuracy. Calibrate LLM judges against humans.

Critical lesson: Opus 4.5 initially scored 42% on CORE-Bench, but after fixing grading bugs, ambiguous specs, and stochastic tasks, the score jumped to 95%. A 0% pass@100 rate usually signals a broken task, not an incapable agent.


The Agent Builder’s Checklist

Phase 1: Design

  • Start with the simplest possible solution (single LLM call + retrieval)
  • Only add agent complexity when simpler solutions demonstrably fall short
  • Choose the right pattern: workflow (predefined) vs agent (model-directed)
  • Define success criteria BEFORE building — write initial eval tasks

Phase 2: Tools & Context

  • Design ACI with same care as HCI — invest heavily in tool documentation and testing
  • Build few, thoughtful tools (not 1:1 API wrappers); consolidate multi-step operations
  • Namespace tools clearly; test prefix vs suffix naming
  • Optimize tool responses for token efficiency (pagination, filtering, truncation)
  • Engineer context at every inference step — treat tokens as a finite budget
  • Use hybrid retrieval: static context files + just-in-time tool-based retrieval

Phase 3: Implementation

  • Keep agent loops simple: while loop with LLM decision → tool execution → context update
  • Implement compaction for long conversations (summarize near context limits)
  • Use structured note-taking for multi-session persistence (progress files, feature lists)
  • Use git for state management in long-running coding agents
  • Test before implementing each session — catch broken state early

Phase 4: Multi-Agent (Only If Needed)

  • Verify the task genuinely benefits from parallelization
  • Use orchestrator-worker pattern with clear delegation instructions
  • Scale effort to query complexity (embed rules in prompts)
  • Enable parallel tool calling for both lead agent and subagents
  • Use extended thinking + interleaved thinking for planning and evaluation
  • Build resume-from-checkpoint systems for error recovery

Phase 5: Evaluation

  • Start with 20-50 tasks from real usage patterns and failures
  • Combine code-based, model-based, and human graders
  • Maintain separate capability evals (hill to climb) and regression evals (catch drift)
  • Run evals in isolated, clean environments — no shared state
  • Graduate high-performing capability evals to regression suites
  • Track pass@k for tasks where one success matters; pass^k where consistency matters
  • Use agents to analyze eval transcripts and improve tools iteratively

Conclusion

The lessons from Anthropic’s six engineering articles converge on a single philosophy: simplicity is not the starting point — it’s the goal. The teams that succeed with agents are not those who build the most complex systems, but those who find the minimum viable complexity for their task and then invest ruthlessly in the details that matter — tool design, context engineering, and evaluation.

Here are the five meta-lessons that emerge from synthesizing all six articles:

  1. Complexity is a cost, not a feature. Every layer you add (multi-agent, long-horizon, async execution) introduces new failure modes. Add layers only when measurement proves they help.

  2. Tools are the new UX. The Agent-Computer Interface deserves the same design rigor as a human-facing product. Anthropic spent more time on tool design than prompt design for their best agents.

  3. Context is the bottleneck. It’s not about how much you can fit in the context window — it’s about the signal-to-noise ratio of every token. Engineer context like you engineer code: deliberately, testably, and with a bias toward deletion.

  4. Evals are the foundation. Without evals, you’re guessing. With evals, you’re engineering. Start early, start small, and let your eval suite compound in value over time.

  5. Let agents improve agents. One of the most striking patterns across Anthropic’s articles is using Claude to optimize the tools, prompts, and descriptions that Claude itself uses. This self-improving loop — build → evaluate → let the agent analyze failures → refine → repeat — is the future of agent development.

The gap between prototype and production is wider than anticipated. But with these lessons from Anthropic as your guide, you have a clear roadmap for navigating that gap.


This post is licensed under CC BY 4.0 by the author.