Anthropic just taught you how to build the best AI agent harness.
Claude Code's source is sitting in the open: 55 directories, 331 modules, the most battle-tested agent architecture in production today. I pulled apart every file. Every architectural decision. Every retry path, every compaction strategy, every permission stage.
This is not a teardown. It is a blueprint.
Here is every principle inside it, and how you use each one to build your own harness that survives production.
What Claude Code's Architecture Reveals About the Industry's Favorite Framework
You have heard the three layers: model weights, context, harness. The industry repeats them at every conference, in every tutorial, in every framework README.
1. Model Weights: the frozen intelligence, the thing you call via API.
1. Context: the prompt, the conversation history, the retrieved documents.
1. Harness: the scaffolding around the model. Tools, loops, error handling.
The framing is correct as far as it goes. The SWE-agent paper from Princeton NLP showed a 64% relative improvement on SWE-bench by changing nothing but the interface design. Same GPT-4. Same tasks. Only the environment changed. Performance gains live in layers 2 and 3, not layer 1.
Then you open Claude Code's source and realise Anthropic is not building for the model. They are building for the system.
Inside Claude Code, a four-level CLAUDE.md hierarchy lets enterprise admins enforce policies via MDM, project maintainers set conventions, and individual developers override locally. A disk-backed task list with file-based locking keeps parallel sub-agents from corrupting each other's state. Git worktree isolation gives five agents five branches on the same repo with zero conflicts. A permission pipeline cascades deny rules from enterprise to project to user to session.
None of that is harness. None of that is context. None of that is weights.
That is infrastructure: multi-tenancy, RBAC, resource isolation, state persistence, distributed coordination.
The real framework has four layers:
1. Model Weights: frozen intelligence.
1. Context: runtime input.
1. Harness: the agent's designed environment.
1. Infrastructure: multi-tenancy, RBAC, resource isolation, state persistence, distributed coordination.
Most teams talk about the first three because they are interesting to think about. The fourth is where products die. Claude Code is the first agent system I have seen that takes all four seriously, and the architecture shows it at every level.
The Core Agent Loop: An Async Generator, Not a While Loop
The heart of Claude Code lives in query.ts: 1,729 lines of TypeScript. The most important decision is in the function signature:
That function* carries more weight than it looks. An async generator yields values over time, pauses on demand, and lets any caller break out at any point.
An agent loop is not a request-response cycle. It is a long-running, streaming, cancellable process. The generator gives you all of those properties without bolting anything on.
Compare it to what most tutorials teach:
This works in a tutorial. It collapses in production for five reasons.
No streaming. The user watches a blank screen for 10-30 seconds while the model generates. Claude Code's generator yields StreamEvent objects as tokens arrive. The user sees the model working character by character. Users who can see what an agent is doing trust it more. Users who trust it give it more autonomy. Autonomy is where useful work happens.
No cancellation. Ctrl+C in the while-loop version requires a separate abort mechanism wired in from outside. With a generator, the caller stops calling .next(). The finally block runs, cleanup happens. Claude Code threads AbortSignal through every layer, and the generator makes that natural.
No composability. The REPL UI consumes the generator. Sub-agents consume it. Tests consume it. One query() function, three callers, zero duplication. Generators are a universal interface for streaming data.
No backpressure. If the model generates faster than the terminal renders, the while-loop buffers everything in memory. A generator pauses production when the consumer stops pulling. In long sessions, that difference decides whether memory stays bounded or grows until the process dies.
No error recovery inside the loop. This is where it gets serious.
Five Phases Per Iteration
Each iteration of Claude Code's agent loop runs through five phases. These phases are the reason the harness is resilient.
Phase 1: Setup. Before calling the model, the loop applies tool result budgets, runs compaction strategies if the conversation is long, and validates token counts. Most harnesses pass the raw message array to the model and hope nothing breaks.
Phase 2: Model Invocation. The loop calls queryModelWithStreaming() through a dependency-injected interface, wrapped in a retry system that handles ten error classes. The streaming tool executor starts executing tools *during this phase*, before the model finishes generating. A Grep call starts running the instant its input JSON is complete in the stream, seconds before the next tool call even begins arriving.
Phase 3: Error Recovery & Compaction. The loop checks for recoverable errors after the model responds. prompt-too-long? Compact and retry. max_output_tokens hit? Escalate from 32K to 64K and retry. Context overflow? Run reactive compaction on media-heavy messages. These are first-class states in the loop's state machine, not edge cases in outer try-catch blocks.
Phase 4: Tool Execution. Tools not yet executed by the streaming executor run here. Results yield to the UI as they complete. Haiku generates tool use summaries asynchronously so the main model does not burn tokens on bookkeeping.
Phase 5: Continuation Decision. The model's stop_reason tells the loop whether more tool calls are needed. The turn counter checks maxTurns. Hooks can request a stop. Abort signals are checked. If continuing, the loop increments state and returns to Phase 1.
Error recovery lives inside the loop, not around it. Each phase knows what can go wrong and has a specific recovery path. That is the difference between an agent that crashes on a rate limit and one that backs off, retries, falls back to a different model, and keeps working.
Dependency Injection Makes It Testable
The loop receives its dependencies through a QueryDeps interface:
Inject a mock callModel that yields predetermined events, and you can verify context overflow handling, tool failures, and cancellation without touching the real API. Most agent harnesses are untestable because they hardcode API calls into the loop. Claude Code's loop is a pure state machine with injected effects.
Tool Execution: Why Concurrency Classification Changes Everything
Claude Code ships 45+ built-in tools. The count is not the point. How they execute is.
Most agent harnesses run tools one by one: model generates tool calls, harness executes them sequentially, results go back. Safe but slow. Some harnesses run tools in parallel. Fast but dangerous: two parallel file writes to the same path corrupt it.
Claude Code classifies every tool by concurrency behavior:
The orchestration layer in toolOrchestration.ts partitions tool calls into batches. Read-only tools (Glob, Grep, Read, WebFetch) run concurrently, up to 10 in parallel. Write tools (Bash with mutations, Edit, Write) run serially. No race conditions.
Claude Code searches five files in parallel, then edits one. The speed of parallelism and the safety of serial execution, at the same time. 2-5x speedup on multi-tool turns, adding up to minutes per session.
The Streaming Tool Executor
The StreamingToolExecutor is the more interesting component. Most harnesses wait for the model to finish generating before executing any tools. Claude Code starts execution mid-stream.
For a turn with three tool calls, that hides 2-5 seconds of latency. The model generates the description of its next step while the first tool already runs. By the time the model finishes, results from early tools may already be waiting.
The hard cases are handled:
If a tool in a parallel batch fails, a per-tool siblingAbortController kills sibling processes. The parent query controller stays alive. The conversation continues. The model receives the error and recovers.
If the stream fails and falls back to non-streaming, the executor discards queued tools and generates synthetic error results for anything in flight.
Results yield in the original order even if tool 2 finishes before tool 1, keeping the narrative coherent for both the model and the user.
Tool Result Budgeting
A Bash command that dumps 1MB of logs would fill the context window with garbage if passed to the model raw. Claude Code runs a budgeting system:
1. Each tool specifies maxResultSizeChars.
1. Results exceeding the limit persist to disk.
1. The model receives a file path reference plus a preview of the first N characters.
1. applyToolResultBudget() runs before each API call to constrain total tool result tokens.
Users will run cat on enormous files. They will pipe commands that produce megabytes of output. Without budgeting, the context fills with noise and the agent loses coherence. This detail does not appear in architecture diagrams. It determines whether an agent survives real usage.
Prompt Engineering at Scale: The System Prompt Is a Caching Problem
The system prompt in Claude Code is not a string. It is a structured array of sections with caching metadata.
The SYSTEM_PROMPT_DYNAMIC_BOUNDARY marker splits the prompt into two zones. Everything above it: identical across all users, all sessions, hitting the prompt cache at the API level globally. That is ~80% of the prompt. You are not re-tokenizing 577+ lines on every API call for every user.
Below the boundary, sections are memoized (computed once per session) or volatile (recomputed every turn). Volatile sections are minimized because each change breaks the cache for everything after it.
No agent tutorial, framework doc, or conference talk I have encountered discusses designing the prompt for cache efficiency. It is one of the highest-leverage decisions in the codebase. At scale, this determines whether your agent costs $0.02 per session or $0.20.
The CLAUDE.md Hierarchy
A four-level instruction hierarchy acts as composable memory:
Higher levels override lower ones. Enterprise admins enforce coding standards organization-wide. Users set personal preferences. Projects define conventions. Developers keep private overrides out of version control.
An @include directive enables composition:
The enterprise level integrates with MDM (Mobile Device Management) for policy enforcement. This is infrastructure engineering, not harness engineering.
Why Context Injection Lives Outside the System Prompt
User context (git status, CLAUDE.md contents, current date) is injected as the first user message, wrapped in <system-reminder> tags:
Context changes every turn. Putting it in the system prompt would invalidate the cache after the change point. Moving it to a user message keeps the system prompt cache-stable turn after turn. Small detail. Significant cost impact.
Context Window Management: Four Compaction Strategies
Most agent harnesses truncate old messages or crash when hitting context limits. Claude Code supports unlimited conversation length through four compaction strategies, ordered cheapest to most expensive.
Strategy 1: Microcompact
Runs every turn, before the API call. If a tool was called and its result has not changed since last call, the system replaces the full result with a cached reference. For tools like Read called repeatedly on the same file, this saves thousands of tokens per session. Cost: near zero.
Strategy 2: Snip Compact
Fires when approaching token limits, before expensive summarization. Removes messages from the beginning of conversation while preserving a "protected tail" of recent messages. No model call required. Lossy but fast.
Strategy 3: Auto Compact
Triggered when token usage crosses a threshold and snip is insufficient. A separate model call summarizes the prior conversation. Old messages are replaced with the summary. The system tracks compaction state to prevent loops (summarizing the summary of the summary).
Strategy 4: Context Collapse
For long-running sessions, enabled via feature flag. Multi-phase staged compression: collapse tool results first, then thinking blocks, then entire sections. The expensive option, reserved for sessions that have been running for hours.
Why the Hierarchy Matters
Cheapest strategy runs first. Most expensive fires only when nothing else works.
Most harnesses that implement compaction at all jump straight to summarization. Summarization costs tokens on both the compaction call and the summary itself. Microcompact and snip handle a large percentage of cases with zero model calls. The hierarchy means you pay for expensive compaction only when cheap compaction fails.
The "protected tail" concept matters too. When compaction runs, recent messages are never summarized away. The model keeps full fidelity on the last N exchanges even if earlier context is compressed. The model can follow through on its current plan without losing track of what it just did.
The Permission System: Seven Stages of Trust
Most agent harnesses ship a binary toggle: allow or deny. Claude Code runs a seven-stage pipeline.
Rules use glob-like pattern matching on tool name and input:
"Allow all bash" is too coarse. Nobody wants their agent running rm -rf /. "Deny all bash" makes the agent useless. "Allow git commands and npm test, prompt for everything else" is the sweet spot. Claude Code's rule engine supports exactly that.
Permission modes create progressive trust:
New users start in default, approving each action. As confidence builds, they move to acceptEdits or bypassPermissions. No binary choice between safety and speed. A spectrum.
Hooks serve as the escape hatch:
Your script receives tool call details and returns {"decision": "approve"} or {"decision": "block"}. Organizations build custom guardrails: block destructive operations, post to Slack on completion, run linters after every file write. No source modifications required.
Error Recovery: The 823-Line Retry System
services/api/withRetry.ts is 823 lines. Every line exists because of a production failure.
429 (Rate Limited): Check the Retry-After header. Under 20 seconds? Retry, keep fast mode. Over 20 seconds? Enter 30-minute cooldown. overage-disabled header present? Permanently disable fast mode, explain why.
529 (Server Overloaded): Track consecutive 529 counts. Three in a row with a fallback model available? Switch models. Background task? Bail to prevent cascade. Foreground? Retry with backoff.
400 (Context Overflow): Parse the error to extract actual and limit token counts. Recalculate: available = limit - input - 1000 safety buffer. Enforce minimum floor of 3,000 output tokens. Retry with adjusted budget.
401/403 (Auth): Clear the API key cache. Force-refresh OAuth tokens. Retry with new credentials.
Network Errors (ECONNRESET, EPIPE, timeout): Disable keep-alive socket pooling. Retry with a new connection.
The backoff formula:
delay = min(500ms × 2^attempt, 32s) + random(0, 0.25 × baseDelay)
For unattended sessions (CI/CD pipelines, background agents), persistent retry mode retries 429 and 529 errors indefinitely. Maximum 5-minute backoff. 6-hour reset cap. 30-second heartbeat emissions prevent idle kills.
The streaming layer runs its own reliability:
An idle timeout watchdog aborts the stream if no chunks arrive for 90 seconds, with a warning at 45. Stall detection logs gaps of 30+ seconds between consecutive chunks after time-to-first-byte. Streaming fallback switches to a non-streaming request if streaming fails entirely, preserving the consecutive 529 count so fallback logic does not double-count.
A three-retry wrapper around fetch is not production reliability. A state machine that understands the semantics of every error class and has a specific recovery path for each one is.
Sub-Agent Architecture: Parallelism With Isolation
Claude Code spawns sub-agents: independent instances of the agent loop, each with its own context, tools, and working directory.
Each sub-agent gets isolated context:
Aborting the parent cascades to all children. But a child cannot mutate the parent's state: appState is a no-op setter. File state caches are cloned to prevent one agent's reads from polluting another's cache.
Git Worktree Isolation
Sub-agents that modify code get their own worktree:
One agent, one worktree. Parallel agents sharing a workspace create conflicts. Worktree isolation puts each agent on its own branch; changes merge when verified. The node_modules symlink prevents disk bloat: five parallel agents do not need five copies of dependencies.
Three Spawn Backends
The multi-agent system supports three execution backends. In-process (direct Node.js, fastest, shared memory). Tmux pane (terminal multiplexer isolation, each agent visible in its own tab). Remote (CCR environment, full machine isolation).
Task coordination uses a disk-backed task list with file-based locking at ~/.claude/tasks/<taskListId>/<taskId>.json. Lock contention handled with exponential backoff (30 retries, 5-100ms). A high water mark prevents task ID reuse after reset.
The Fourth Layer: Infrastructure
Everything above describes the harness. Layer 3. Now look at what Claude Code builds around it.
Multi-Tenancy
The CLAUDE.md hierarchy is a multi-tenancy system. Enterprise policies at /etc/claude-code/CLAUDE.md apply to every developer in the org. Project policies in .claude/CLAUDE.md apply to contributors. User preferences in ~/.claude/CLAUDE.md are personal. Local overrides in CLAUDE.local.md are private.
This is RBAC for agent behavior. Enterprise admins set guardrails. Project maintainers set conventions. Developers set preferences. Each level overrides the one below. Conflicts resolve deterministically.
State Persistence Across Sessions
Compaction strategies are state persistence. Auto-compact produces a summary that becomes starting context for the next loop iteration. CLAUDE.md files preserve project-level memory across sessions. Hooks persist arbitrary state to disk. The task coordination system maintains state across multiple agent processes.
Claude Code solves session management at three levels: within a session (compaction), across sessions (CLAUDE.md), across agents (task list).
Resource Isolation
Git worktree isolation gives each sub-agent its own filesystem. The siblingAbortController contains tool failures so they do not cascade to siblings. Enterprise-level deny rules prevent agents from touching resources they should not access.
Distributed Coordination
File-based locking on the task list is distributed coordination. Prompt cache sharing between parent and child agents is distributed resource optimization. The heartbeat in persistent retry mode is a keepalive pattern. Worktree management handles concurrent agent access to a shared repository.
These are infrastructure problems with infrastructure solutions: locking, coordination, isolation, state management, access control, resource sharing.
Why This Matters for Your Build
You will hit every one of these problems. The question is whether you solve them with duct tape or design for them from the start.
The three-layer model does not prepare you. It frames the harness as the ceiling. But the harness describes how one model instance interacts with one set of tools in one session. The moment you need multiple users, multiple sessions, multiple agents, or deployment to environments you do not control, you are in layer 4.
Distributed systems engineering is becoming a core competency for agent builders. Production agent systems run on CI servers, spawn sub-processes, share state across sessions, serve users with different permissions. Teams that understand this build agents that work. Teams that stop at the harness build demos.
Extensibility: Four Mechanisms, Zero Source Modifications
Claude Code has four extension mechanisms. None require modifying source code.
Skills (Markdown Files as Commands)
Review the staged changes and create a commit message...
Markdown files with YAML frontmatter. Five sources: bundled, project, user, plugin, MCP. Path-based discovery means a skill specifying paths: ["*.tsx"] only activates when the agent touches matching files. The agent sees relevant skills, not all skills.
Hooks (Event-Driven Automation)
Six types: shell commands, LLM evaluation, agentic verification, HTTP endpoints, TypeScript callbacks, in-memory functions. Fires on PreToolUse, PostToolUse, SessionStart, FileChanged, Stop.
Hooks connect the harness to existing infrastructure. Post to Slack on task completion. Run a security scanner before every bash command. Trigger CI after every file edit. No changes to the agent loop.
MCP (Model Context Protocol)
Five transport types: stdio, SSE, HTTP streaming, WebSocket, in-process. Configured at three levels: enterprise managed, project, user. MCP gives the agent access to external systems (databases, APIs, internal tools) through a standardized protocol.
Plugins
Directories containing skills, agents, hooks, and configuration. The top-level composition mechanism: add capabilities without touching existing files.
All four mechanisms follow the same principle: composition over modification. Extend by adding, not changing. Core updates do not break extensions. Extensions do not interfere with each other.
The UI Is a Trust Mechanism
Claude Code's terminal UI runs on a custom fork of Ink (React for terminals): 251KB of rendering engine. That sounds heavy for a CLI. It is not.
Real-time streaming text renders character by character. Animated spinners interpolate from normal to error-red based on stall duration. Diff rendering shows syntax highlighting, three lines of context, word-level change markers. Multi-agent status trees display the hierarchy of active agents. Six themes include color-blind friendly options. A status line shows model name, cost in USD, context window usage percentage, rate limit utilization.
Users who can see what an agent is doing, which tool it calls, what results look like, how much context is consumed, how much it costs, give that agent more autonomy. More autonomy means more useful work done. The UI is a force multiplier.
The context window usage bar gives users an intuitive sense of remaining capacity without requiring them to understand tokenization. When it fills up, they know to wrap up or let the agent compact.
What You Take From This
You do not need to rebuild Claude Code. You need the engineering decisions behind it.
Async generators for the agent loop. Streaming, cancellation, composability, backpressure: all inherent to the abstraction. A while loop that returns a complete result leaves all four on the table.
Concurrency classification for tools. Read-only tools run in parallel. State-mutating tools run serially. 2-5x speedup, zero race conditions. Mark each tool at definition time and let the orchestration layer handle batching.
Tool execution during streaming. Parse tool calls incrementally. Start execution the instant input JSON is complete. Free latency savings on every multi-tool turn.
System prompt designed for the cache boundary. Static content first. Dynamic content last. Boundary marked explicitly. Highest-leverage cost optimization in production.
A compaction hierarchy, not a single strategy. Cheap first (microcompact, snip). Expensive last (summarization, collapse). Pay for expensive compaction only when cheap compaction fails.
Error recovery as a first-class state in the loop. Each error type (rate limit, context overflow, auth failure, network error) gets its own recovery strategy inside the state machine. Not an outer try-catch.
Layer 4 from day one. Where does state live across sessions? How do permissions scale to teams? How does coordination work when you add parallelism? Retrofitting infrastructure is harder than designing for it by an order of magnitude.
Extension points that require no code changes. Markdown files, shell scripts, protocol-based tools cover 95% of extension needs. If users fork your code to customize behavior, your architecture has a gap.
The Model Is Commodity. The Environment Determines Outcomes.
Princeton NLP proved it with SWE-agent: same model, better environment, 64% improvement. Anthropic proves it daily with Claude Code: a 55-directory, 331-module TypeScript application that turns the same Claude model powering a chat interface into a coding agent that runs unattended for hours, recovers from API outages, manages its own context across thousand-turn sessions, and coordinates multiple parallel sub-agents on the same codebase.
Four layers. Most of the industry optimizes layer 1: bigger models, higher benchmark scores. The teams winning are investing in layers 3 and 4. Better environments. Better error recovery. Better permission systems. Better context management. Better coordination.
The source code is right there. The patterns are documented. The decisions are legible.
Build the harness. Then build the infrastructure around it.





