You can build the next $1B company in a weekend using AI
## And the only skill you need to learn is
TLDR; if you don't want to read 7,862 words of roadmap, then you can simply just give this link to your agent to personalise a roadmap ➡️https://raw.githubusercontent.com/codejunkie99/agent-roadmap-2026/main/AGENT.md
agentic ai & harness engineering
The problem is that most engineers have no clear idea what they should learn
Some pick CrewAI because the role-based demos look slick on Twitter
Some chase every new framework that ships and never finish anything real
Others jump straight into multi-agent systems without understanding context, tools, harnesses, or evals
The result is usually the same: a lot of framework tourism and very little production-ready skill
If your goal is to become an agent engineer in 2026, you don't need to learn 12 frameworks
You need to learn how to build, harness, evaluate, and ship real agent systems in production
That means learning how to:
- build agents on a real orchestration runtime like LangGraph
- work with the Claude Agent SDK as a reference harness
- engineer context properly with Write, Select, Compress, Isolate
- write tools the model picks correctly
- add memory, durability, and sandboxing for production traffic
- build evals, trajectory checks, and CI regression gates
- ship agents that survive contact with real users and real cost
This guide is a 6-phase roadmap built on what shipped in late 2025 and early 2026
The piece is 7,000+ WORDS and pulls from primary sources only
But its real value is that every phase has a concrete project, a canonical reading list, and the exact resources you need
That way, within roughly 17 WEEKS of focused work, you can reach the level of agent engineer who can own a production AI feature end to end
Researching this took more than 60 HOURS of reading primary engineering blogs, papers, and shipping-engineer surveys
Now let's start reading the roadmap ⬇️
What an Agent Engineer does in 2026
A lot of people hear "AI agent engineer" and imagine someone gluing together CrewAI roles and calling it shipped
In reality, most modern agent engineers do something much more practical
They build, harness, and operate agent systems on top of frontier models
That usually includes:
- designing the agent loop and tool dispatch
- engineering context with Write, Select, Compress, Isolate
- writing tools the model selects correctly
- orchestrating sub-agents with isolated context windows
- adding skills, memory, durability, and sandboxing
- wiring evals, traces, and CI gates so "better" becomes measurable
Same model, different harness, completely different result
Anthropic's own measurement: Opus 4.5 scored 78% on CORE inside Claude Code, and 42% inside Smolagents
Same model. Full stop
That gap is harness engineering, which is what this roadmap is about
The four context primitives every agent builder needs to know: Write (scratchpads, memory files), Select (retrieval at the point of use), Compress (summarization at 85–95% of the context window), Isolate (sub-agents with their own context windows)
Anthropic's multi-agent research system beat single-agent Opus 4 by 90.2% on breadth-first research using exactly this pattern, while burning ~15× the tokens
In practice there are only two stacks worth learning deeply in 2026: LangGraph 1.0 + Deep Agents, and the Claude Agent SDK
The rest are either fading out, getting absorbed, or worse versions of these two for production
Free resources to follow throughout the roadmap
These are the blogs, courses, channels, and newsletters that ship signal for free
Subscribe to them in Phase 0 so the rest of the roadmap lands on a steady drip of new posts, case studies, and primary-source updates
None of these are paywalled, and most of them update faster than any textbook ever could
Engineering blogs to subscribe to
Resources:
- Anthropic engineering blog (free, official) — If you read one blog, read this one. Context engineering, harness design, multi-agent research, advanced tool use, evals. All primary sources, all referenced repeatedly across this roadmap
- LangChain blog (free) — Where the harness, middleware, and Deep Agents discipline get formalized in public. Read everything by Lance Martin, Vivek Trivedy, and Harrison Chase
- OpenAI Cookbook (free, GitHub) — Working notebooks for every API feature. Tool use, structured outputs, evals, agents. Type along
- Hamel Husain's blog (free) — "Your AI Product Needs Evals" is the eval essay everyone links to. Everything else on the site is in the same league. If you build evals, read this twice
- Eugene Yan's blog (free) — "Patterns for Building LLM-based Systems & Products" is the practitioner write-up everyone references. Opinionated and calibrated against real shipping experience
- Lilian Weng's blog (free) — Long-form deep dives on agents, prompt engineering, hallucination, alignment. The clearest synthesis writing in the field
- Simon Willison's blog (free) — Daily notes from a senior engineer who ships. Good for sanity-checking hype and catching weird edge cases first
- Chip Huyen's blog (free) — ML systems from first principles. Her "Building LLM applications for production" piece is required reading before Phase 5
- Phil Schmid's blog (free) — Practical end-to-end guides on HuggingFace, Gemini, fine-tuning, deployment. Always shows the code
- Cameron Wolfe writes Deep (Learning) Focus (free) — Long-form paper breakdowns. Catch up on a research area in one read
Free courses worth completing
Resources:
- DeepLearning.AI Short Courses (free) — Short 1–2 hour courses, almost all free. The LangGraph course (built with LangChain) and Andrew Ng's "Agentic AI" course (Reflection, Tool Use, Planning, Multi-Agent design patterns) are the two to complete in Phase 0
- LangChain Academy: Introduction to LangGraph (free) — The official free course. State, memory, human-in-the-loop, multi-agent. Do this in Phase 2
- Anthropic Interactive Prompt Engineering Tutorial (free, GitHub) — Nine chapters as Jupyter notebooks against the Claude API. The fastest way to build prompting muscle
- HuggingFace Agents Course (free) — End-to-end coverage of agents, smolagents, MCP, and evaluation. Free certificate
- HuggingFace LLM Course (free) — Foundations: tokenization, transformers, fine-tuning. Useful background even if you only build on APIs
- MCP Fundamentals on FreeAcademy (free) — Build MCP servers, connect them to Claude, write custom tools. The fastest path to MCP literacy
YouTube channels and talks
Resources:
- Andrej Karpathy (free) — Neural Networks: Zero to Hero builds GPT from scratch in raw Python. His 2026 "Vibe Coding to Agentic Engineering" talk at Sequoia AI Ascent is the clearest take on why harness engineering matters now
- AI Engineer (free) — All AI Engineer Summit and World's Fair talks. Search for talks by Hamel Husain, swyx, Anthropic engineers, and Erik Schluntz
- LangChain (free) — Weekly tutorials on LangGraph, Deep Agents, middleware, and integrations. Often the first place new features land in video form
- Anthropic (free) — Talks from Anthropic engineers. Multi-agent research walkthroughs, Claude Code internals, Skills
- Yannic Kilcher (free) — Paper breakdowns. Saves you reading every arXiv preprint yourself
- Lex Fridman Podcast on YouTube (free) — Long-form interviews with the people building and researching AI. Karpathy, Schulman, Sutskever, Amodei
Newsletters worth subscribing to
Resources:
- Latent Space by swyx and Alessio (free) — The technical newsletter for AI engineers. AINews daily roundup, podcast, and the annual "AI Engineering Reading List". If you only subscribe to one, this is it
- The Batch by Andrew Ng (free) — Weekly broad-spectrum coverage. Good for noticing when something new is breaking out
- Import AI by Jack Clark, Anthropic co-founder (free) — Policy plus research roundup. Closest thing to a strategic context briefing for the field
- Ben's Bites (free) — Daily AI news in five minutes. Skim only. Useful for catching announcements you'd otherwise miss
- TLDR AI (free) — Daily digest, low-noise. Pair with one of the deeper newsletters above
- AI Engineer Pack by swyx (free) — Curated free credits, tools, and resources for AI engineers. Updated continuously
Open-source repos worth studying
Resources:
- Anthropic Cookbook (free, GitHub) — Reference implementations of every workflow pattern. Already on the Phase 0 list. Re-read it after each phase
- OpenAI Cookbook (free, GitHub) — Same idea, OpenAI side. Tool use, structured outputs, evals, agents
- deepagents by LangChain (free, GitHub) — The reference open-source harness on top of LangGraph. Read the middleware files when you build your own harness in Phase 3
- LangGraph examples (free, GitHub) — Runnable LangGraph patterns. Supervisor, hierarchical teams, planning, customer support agent
- inspect_evals (free, GitHub) — 200+ standard evals as a Python package. GAIA, SWE-bench, Cybench, BFCL
- awesome-agentic-engineering-resources (free, GitHub) — Community-curated index of agent engineering resources. Use to fill gaps this roadmap doesn't cover
Podcasts for the commute
Resources:
- Latent Space (free) — Long-form interviews with the people shipping the field. Anthropic, OpenAI, LangChain, Modal, E2B all on the guest list
- Dwarkesh Podcast (free) — Long interviews on AI strategy, capability, and policy. Long-form, primary sources
- The TWIML AI Podcast by Sam Charrington (free) — Weekly technical interviews with researchers and engineers
- Practical AI (free) — Engineering-focused. Less hype, more shipping
- The MAD Podcast by Matt Turck (free) — Founder plus investor lens on the data and AI ecosystem. Useful for tracking who is shipping vs raising
Communities worth joining
Resources:
- LangChain Discord (free) — Where you'll find the LangGraph and Deep Agents core team. Active #help channels
- HuggingFace Discord (free) — Largest open-weights and ML community
- r/LocalLLaMA (free) — Open-weights model news, benchmarks, and tooling. Often faster than the official channels
- AI Engineer World's Fair (free with signup) — The professional network of the field. Job postings, hiring channels, working groups
- Anthropic Discord (free) — Claude developer community. Skills sharing, hooks patterns, MCP servers
*What to focus on:* pick one blog, one newsletter, one podcast, and one community in Phase 0. Don't try to follow all 40+ resources at once
Add more only when the existing ones stop surprising you
The point of this list is breadth so you can choose, not a checklist to complete
Phase 0: Foundations (1–2 weeks)
Your goal this phase: Build correct mental models. Don't write a single line of agent code yet beyond throwaway scripts
Most beginners skip this phase, dive straight into framework tutorials, and end up with code they can't reason about when it fails. Don't skip it
What to learn
1. The augmented LLM and the workflow vs agent distinction
Before you touch a framework, you need to understand the five workflow patterns Anthropic identified (prompt chaining, routing, parallelization, orchestrator-worker, evaluator-optimizer) and why a workflow is not the same thing as an agent
A workflow has a fixed control flow you wrote
An agent makes its own control-flow decisions inside a loop
This distinction will save you from building agents that should have been chains
Resources:
- Building Effective Agents by Anthropic (Erik Schluntz and Barry Zhang) (Dec 2024) (free, official) — The five workflow patterns plus the augmented-LLM concept. Everyone in the field cites this. Read it first
- Anthropic Cookbook (patterns/agents folder) (free, GitHub) — Reference implementations of every workflow pattern as runnable notebooks. Type along, don't just read
- Simon Willison's annotations of Building Effective Agents (free) — A senior engineer's sanity-check perspective on the same paper
*What to focus on:* the difference between a workflow and an agent, the augmented-LLM mental model, the orchestrator-worker pattern, why parallelization usually beats sequential reasoning, and the failure modes Anthropic explicitly warns about
2. Context engineering as a discipline
Prompt engineering is dead as a standalone skill in 2026. The replacement is context engineering: deciding what tokens are in front of the model at every step of the loop
Resources:
- Effective context engineering for AI agents by Anthropic (Sep 29, 2025) (free, official) — Read this one twice. Memorize the framing
- Context Engineering for Agents by Lance Martin (LangChain) (free) — The Write, Select, Compress, Isolate framework. The one mental model you need
- How we built our multi-agent research system by Anthropic (Jun 2025) (free, official) — The orchestrator-worker reference architecture, the 90.2% breadth-first research improvement, and the 15× token caveat
- Simon Willison's annotations of the multi-agent research post (free) — Sanity-check perspective on the architecture and the cost trade-offs
*What to focus on:* what each of Write, Select, Compress, and Isolate means in code, why sub-agents are an isolation primitive (not a parallelism primitive), and when you would use compaction vs offloading vs summarization
3. The harness as an operating system
A clean walkthrough of what "harness" means
Resources:
- The Complete Guide to Harness Engineering (ClaudeCodeLab) (free) — Three-level harness escalation with runnable code
- Inside the Claude Agents SDK (ML6) (free) — The CPU/RAM/OS/App analogy plus the 78% vs 42% Opus 4.5 number that motivates this whole roadmap
- Building agents with the Claude Agent SDK (Anthropic) (free, official) — Why the SDK exists, why it was renamed from Claude Code SDK
- Effective harnesses for long-running agents by Anthropic (Nov 26, 2025) (free, official) — Anthropic's own harness primer. Read alongside Vivek Trivedy's posts to triangulate the same ideas from a different team
- Harness design for long-running application development by Anthropic (Mar 24, 2026) (free, official) — The follow-up. What changes when sessions stretch to hours and days. Phase 3 essential reading too
- How to think about agent frameworks by Harrison Chase (LangChain) (free) — The orchestration framework vs abstraction distinction. Required before you pick anything
*What to focus on:* the loop, tool dispatch, context curation, persistence, hooks, sub-agent orchestration, observability. And how each of those gets implemented in any harness you'll meet
4. The 2026 state of the field
Resources:
- State of Agent Engineering (LangChain) (free) — 1,340 respondents, Nov–Dec 2025. Get the numbers in your head: 57% of teams in production, 89% have observability, 52% have evals, quality (32%) is the #1 barrier
- How to Build an Agent (LangChain) (free) — The "smart intern" framing for scoping what an agent should and should not own
- Continual learning for AI agents by Harrison Chase (LangChain) (free) — Three layers where agents actually learn: weights, prompts, memory. The framing you need before you reach for fine-tuning anything
*What to focus on:* where teams struggle in production (quality, cost, reliability), what the median stack looks like, and where the marginal hour of effort pays off
Practice project: Write a 2-page personal doc, by hand, that defines in your own words: workflow vs agent, augmented LLM, the four context-engineering primitives, the orchestrator-worker pattern, the difference between harness, model, and framework, and the top three failure modes you expect to see in your own code
This document is the actual deliverable
If you can't write it without looking, you haven't read carefully enough
Phase 0 Milestone
By the end of this phase you should be able to:
- Explain what an agent is and how it differs from a workflow without using framework jargon
- Name the four context-engineering primitives and give a code-level example of each
- Explain why the harness contributes more than the model in 2026
- Describe the orchestrator-worker pattern and the 15× token cost trade-off
- Pick a framework on architectural grounds, not vibes
Phase 1: Build your first simple agent (2–3 weeks)
Your goal this phase: Write a tool-using agent twice. Once with Anthropic's raw SDK, once with the Claude Agent SDK harness. Feel the difference between rolling your own loop and standing on a real harness
This is the cheapest possible way to understand what a harness gives you
What to learn
1. The agent loop from scratch
The loop is not magic. You call the model with messages and tools, you parse out tool_use blocks, you execute the tools, you append tool_result, you loop until stop_reason equals end_turn
Once you've written this in ~100 lines yourself, every framework becomes readable
Resources:
- Tutorial: Build a tool-using agent (Anthropic docs) (free, official) — The reference for tool_use, tool_result, parallel tool calls, and the response loop
- Writing tools for agents (Anthropic) (free, official) — Read this before you design any tool. The descriptions for your tools and their parameters are the user manual for the LLM
- Equipping agents for the real world with Agent Skills (Anthropic) (free, official) — The progressive-disclosure pattern explained by the team that wrote the spec
*What to focus on:* how the request/response loop terminates, what stop_reason values mean, how parallel tool calls are encoded, error recovery when a tool throws, and how to design a tool description so the model picks it correctly
Practice: Build a "from scratch" agent in 100 lines using anthropic.messages.create with a tool spec. Three tools: web_search via Tavily or Firecrawl, read_file, write_file. No framework. Run it on a research task and read every step of the trace
2. The Claude Agent SDK as the canonical harness
The Claude Agent SDK is the same harness that powers Claude Code
You will study it as a reference and use it as your day-1 tool
Resources:
- Claude Agent SDK docs (free, official) — The Python and TypeScript SDKs, hooks, sub-agents, skills, and the Task tool
- Claude Agent SDK, Skills reference (free, official) — How SKILL.md files work, the metadata frontmatter, progressive loading
- claude-code-best-practices by Muhammad Usman GM (free, GitHub) — Skim, don't copy wholesale. Useful for seeing what real users do
- claude-code-best-practice by Shan Raisshan (free, GitHub) — Companion compendium with a different curation slant
- Evaluating Skills (LangChain) (free) — How LangChain measures whether a Skill is actually pulling its weight. Useful once you've written your first Skill in this phase and want to know if it's helping or hurting
*What to focus on:* the CLAUDE.md system-prompt pattern, how Skills are loaded progressively, the PreToolUse and PostToolUse hooks, spawning sub-agents via the Task tool, and how the SDK handles permission prompts
Practice: Rebuild the same agent from the previous topic using claude-agent-sdk. Add a CLAUDE.md with project conventions. Add one Skill (folder with SKILL.md) that defines a "research-summary" output format. Add one PostToolUse hook that auto-formats any file the agent writes. Spawn one sub-agent for a sub-task using the Task tool
3. Ship something tiny
Tutorials don't count. You need a thing that runs on a schedule and that you read the output of
Practice project: A daily-briefing agent that reads your local Markdown notes and a couple of RSS feeds, produces a summarized briefing with citations, and writes it to disk. Cron it via launchd or systemd. Run it for a week. Watch it fail. Fix it
Phase 1 Milestone
By the end of this phase you should be able to:
- Write a tool-using agent loop in under 100 lines without a framework
- Explain what stop_reason values mean and how parallel tool calls work
- Build the same agent on the Claude Agent SDK with a Skill, a hook, and a sub-agent
- Articulate, in 200 words, what the harness gave you for free that you wrote yourself in the from-scratch version
Phase 2: Build a real agent with proper architecture (3–4 weeks)
Your goal this phase: Build a multi-step, persistent, stateful agent on LangGraph 1.0 + LangChain create_agent + Deep Agents
This is the stack you'll likely run in production. The conceptual model (state machine of nodes and edges, middleware, checkpointer) generalizes everywhere
Why this stack and not Pydantic AI, OpenAI Agents SDK, or CrewAI:
- LangGraph is the only framework in the Alice Labs and Channel.tel "what ships" rankings that combines durable execution, checkpointing, human-in-the-loop, first-class observability via LangSmith, and middleware
create_agent (LangChain 1.0, Oct 2025) is now the default agent factory built on the LangGraph runtime. Create_react_agent is deprecated
Deep Agents (LangChain, launched Aug 2025; v0.5 alpha April 2026) is a batteries-included harness on top. Planning, virtual filesystem, sub-agents, summarization, skills. And it's the closest open-source analog to Claude Code's harness, but model-agnostic
What to learn
1. The LangGraph runtime
A state graph of nodes and edges, with a checkpointer that lets you resume, rewind, and fork
Resources:
- LangGraph docs (free, official) — The runtime reference. Start with the concepts page, then the quickstart
- Doubling down on Deep Agents (LangChain) (free) — Defines harness vs framework vs runtime cleanly
- Context Management for Deep Agents (LangChain) (free) — The 20K-token tool-response offload pattern and the 85% context-window compression triggers
- On Agent Frameworks and Agent Observability (LangChain) (free) — Why LangSmith is OTEL-friendly and works without LangChain. Useful even if you choose another platform later
- Deep Agents v0.5 (LangChain) (free) — The April 2026 release notes. Async (non-blocking) sub-agents, expanded multi-modal filesystem support, async TODOs. Read this before you pin a deepagents version in your project
*What to focus on:* state schemas, nodes, edges, conditional edges, the PostgresSaver checkpointer, time-travel debugging, human-in-the-loop interrupts, and how middleware composes
2. Middleware as the customization layer
Middleware is how you customize a packaged agent without forking it
Resources:
- How Middleware Lets You Customize Your Agent Harness (LangChain) (Mar 26, 2026) (free) — The before_agent, wrap_model_call, before_tools, after_tools hooks. Required reading
- Introducing ambient agents (LangChain) (free) — Background-agent UX patterns: notify, question, review
*What to focus on:* where each hook fires in the agent lifecycle, how SummarizationMiddleware and FilesystemMiddleware compose, how to write a custom middleware in 30 lines, and when middleware is the right answer vs writing a new node
3. Tools, MCP, and the code-execution pattern
The naive "load all MCP tools into context" pattern is broken. The correct pattern is code execution with MCP
Resources:
- Code execution with MCP (Anthropic) (Nov 2025) (free, official) — The 150K → 2K token reduction. Read this before you wire any MCP server
- Introducing advanced tool use (Anthropic) (free, official) — defer_loading: true cut tool tokens 85% and lifted Opus 4.5 MCP eval from 79.5% to 88.1%
- Scaling Managed Agents (Anthropic) (free, official) — The session, harness, and sandbox separation. Read it even if you don't use Managed Agents
- Composio docs (free tier) — 200+ SaaS integrations, MCP gateway built in, brokers credentials so they never enter the model context
- Arcade docs (free tier) — Use when you need fine-grained per-user identity rather than service-level auth
*What to focus on:* defer_loading, code execution as a tool surface, why round-tripping JSON through the model is expensive, and how Composio or Arcade brokers SaaS auth without leaking credentials into the model context
4. Memory choices that aren't a vector DB
Resources:
- Letta MemFS benchmark on LoCoMo (free) — The April 2026 result: filesystem-based memory on GPT-4o-mini hit 74% on LoCoMo, beating bespoke memory tools
- Mem0 docs (free) — User-scoped knowledge memory. Pick this for cross-session user facts
*What to focus on:* the three memory layers (thread-scoped via PostgresSaver, user-scoped via Mem0/Zep, self-managed via Letta), why filesystem is the right default, and not reaching for a vector DB until you've measured an actual recall problem
Practice project: Build a "research analyst" deep agent
Input: a research question
The lead agent plans, writes a TODO list to a virtual filesystem, and spawns 3 search sub-agents in parallel, each with isolated context
Sub-agents call Tavily or Firecrawl, write results to files, and return short summaries to the parent. Never raw search results into the parent's context
A citation sub-agent verifies claims against retrieved sources
A writer agent produces a final Markdown report with inline citations
All state persists via PostgresSaver. Kill the process mid-run, resume from where it left off
Human-in-the-loop interrupt: agent must ask for confirmation before exceeding $1 in tokens
Wrap the whole thing in a single make demo target that runs the full pipeline end to end
README must articulate: which middleware you used and why, which sub-agents have isolated context, what your context-compression strategy is, and what your durability story is on process kill
Ship a LangSmith trace URL for one full run alongside the README
Phase 2 Milestone
By the end of this phase you should be able to:
- Build a multi-step LangGraph agent with PostgresSaver durability and human-in-the-loop interrupts
- Use Deep Agents middleware (planning, filesystem, sub-agents, summarization) as a packaged harness
- Spawn isolated-context sub-agents and return compressed summaries to the parent
- Articulate your context-compression strategy and your durability story on process kill
- Produce a LangSmith trace URL showing the full multi-step trajectory
Phase 3: Build the harness layer yourself (3–4 weeks)
Your goal this phase: Stop using a packaged harness and build a thin one. You'll never make the right harness trade-offs in production until you've built one once
This is the highest-leverage phase in the roadmap
What to learn
1. What "harness" decomposes into
Synthesizing the Deep Agents middleware list, the Claude Agent SDK architecture, and Vivek Trivedy's harness-engineering write-up, the harness is the union of:
- loop control. The while-loop driving model→tools→model
- tool dispatch. Registry, schema validation, parallel calls, error recovery, retries
- context management. System-prompt assembly, message-history compaction at 85–95% of window, tool-response offloading at ~20K tokens, prompt caching
- persistence. Checkpoint state every node so you can resume, rewind, fork
- sub-agent orchestration. Spawn isolated-context children, route compressed summaries back
- skills and progressive disclosure. Load capabilities only when relevant
- hooks. PreToolUse, PostToolUse, PreCompact, Stop, SessionStart (the Claude Code list is canonical)
- observability. OTEL spans for every model call, tool call, sub-agent invocation, with token counts and latency
- sandboxing. Code execution and MCP tool calls happen in a container the model never has direct creds to
auth and secrets brokering. Credentials never enter the model's context (Anthropic Managed Agents pattern)
Resources:
- The Anatomy of an Agent Harness (LangChain) (free) — The cleanest decomposition of harness components in the public literature. Reference text for the entire phase. Read this before you write a single line of harness code
- Improving Deep Agents with harness engineering by Vivek Trivedy (LangChain) (Feb 17, 2026) (free) — Went from rank 30 to rank 5 on Terminal-Bench 2.0 only by changing the harness, holding the model fixed at GPT-5.2-codex. The recipe is in the post
- Better Harness: A Recipe for Harness Hill-Climbing with Evals by Vivek Trivedy (LangChain) (Apr 29, 2026) (free) — The direct sequel. Self-verification and tracing as the recipe for autonomously improving a harness. Read this immediately after the Feb 17 post
- Inside the Claude Agents SDK (ML6) (free) — The CPU/RAM/OS/App analogy and the 78% vs 42% harness-comparison number
- everything-claude-code (Cerebral Valley × Anthropic hackathon winner) (free, GitHub) — For inspiration on where to stop adding features
- deepagents source (free, GitHub) — Read this alongside your own harness as a reference. The middleware files are the core of the harness pattern
*What to focus on:* which harness components are worth writing yourself, which to import, and the order in which features pay off (loop and tool dispatch before sub-agents before durability before observability)
2. Durable execution as an add-on
Resources:
- Inngest docs (free) — Durable steps and checkpointing went GA in Dec 2025. The easiest path to durability for a Python harness
- Temporal Python SDK (free) — The OpenAI Agents SDK and Temporal integration shipped in March 2026. Treat each tool call as a durable step
*What to focus on:* idempotency keys per step, retry policies, what happens to in-flight tool calls on process kill, and where your harness's checkpoint boundary should be (per node, not per token)
Practice project: Write mini-harness in ~1,500 lines of Python
A loop wrapping anthropic.messages.create or LiteLLM for model-agnosticism
Tool registry from a Python decorator (@tool) with JSON-schema generation
A CLAUDE.md-style system-prompt loader that reads ./harness/rules/*.md with path-glob matching
A SKILL.md progressive-disclosure loader (aim for under 50 tokens of metadata per skill in context)
A sub-agent spawn primitive with isolated context, returning a summary string back to parent
Filesystem offload: any tool result over 20K tokens is written to ./workspace/<id>.txt and replaced in context with a path plus 10-line preview
Auto-compaction at 85% of context window: summarize messages older than the last 10 turns
A pluggable hook system (pre_tool, post_tool, stop)
OpenTelemetry tracing via opentelemetry-sdk exported to LangSmith or Phoenix (both speak OTEL)
Durable resume: persist message history and state to SQLite after each step, reload by run ID
Optional add-on: wrap the whole thing in Inngest or Temporal so each tool call becomes a durable step
Phase 3 Milestone
By the end of this phase you should be able to:
- List the ten components of a modern harness and explain when each pays off
- Write a 1,500-line Python harness with loop, tool dispatch, context compression, sub-agents, hooks, and OTEL traces
- Wire durable execution via Inngest or Temporal so a process kill is recoverable
- Produce a 1,000-word post-mortem comparing your mini-harness to the Claude Agent SDK and Deep Agents. What you got right, what you cut, what you'd do differently
- That post-mortem is the real deliverable. The code is just evidence
Phase 4: Build the eval and regression harness (3–4 weeks)
Your goal this phase: Make your agent measurable. Without this, every "improvement" is vibes
This is where most engineers stall. They can build a great agent and can't tell whether their next change made it better or worse
What to learn
1. Pick exactly one observability platform
Don't run two. The five real options:
- LangSmith. Pick if you live in LangGraph or LangChain. Native tracing. March 2026 added Sandboxes, the Polly debugging assistant, Skills, and Fleet (agent identity/sharing)
- Braintrust. Pick if you want framework-agnostic CI quality gates that block PRs. $80M Series B Feb 2026. Flat $249/mo for unlimited users vs LangSmith's $39/seat
- Arize Phoenix (open source) and Arize AX (managed). Pick if you want OpenTelemetry-native, drift detection, and a clean migration path from OSS to managed
- W&B Weave. Pick if you're already on Weights & Biases for ML. Now has full agent trace views, MCP auto-logging, and forthcoming A2A tracing
Inspect (UK AISI). Pick for benchmark-grade evals. GAIA, SWE-bench, Cybench, BFCL all ship as inspect_evals packages. Used by Anthropic, DeepMind, and Grok internally
Resources:
- LangSmith docs (free tier, official) — Production tracing, online evals, experiments, and the new Polly debugging assistant
- Inspect AI annotated notes by Hamel Husain (free) — Hamel's notes are the practitioner write-up I lean on. Read this before installing Inspect
- Inspect docs (free, official) — The framework reference
- inspect_evals (free, GitHub) — 200+ standard evals as a Python package. GAIA, SWE-bench, Cybench, BFCL
- Braintrust docs (free tier) — Framework-agnostic experiments, CI gates, and golden datasets
- Agent Evaluation Readiness Checklist (LangChain) (free) — 17-minute practical checklist: error analysis, dataset construction, grader design, offline and online evals, production readiness. Print this and tape it to your monitor for the entire phase
- Quantifying infrastructure noise in agentic coding evals (Anthropic) (Feb 05, 2026) (free, official) — Flaky sandboxes and network jitter alone can swing eval scores by several points. Before you trust any agent benchmark number (yours or someone else's), read this
*What to focus on:* trace sampling strategy, online vs offline evals, the difference between a metric and a guardrail, and why CI gating is the pattern that turns evals from dashboard wallpaper into a development tool
2. The four eval types you must implement
Per Anthropic's "Demystifying evals for AI agents":
- Single-turn evals: given this input, is the output right? Cheapest, deterministic graders where possible, run constantly
- Trajectory evals: did the agent call the right sequence of tools with the right arguments? Test single-step, full-turn, and multi-turn variants
- LLM-as-judge: for open-ended outputs (research reports, code review). Calibrate against human-graded examples weekly. Anthropic's research-agent rubric used 0.0–1.0 across factual accuracy, citation quality, completeness, source quality, tool efficiency
End-state evals: for stateful agents (DB writes, file edits). Compare the final state of the environment to ground truth. This is τ-bench's approach
Resources:
- Demystifying evals for AI agents (Anthropic) (free, official) — Anthropic's best primer on the topic
- Evaluating Deep Agents: Our Learnings (LangChain) (free) — Single-step, full-turn, and multi-turn trajectory eval patterns. The practitioner guide
- How we build evals for Deep Agents (LangChain) (free) — Companion piece. How they actually source data, design metrics, and run well-scoped evals. Pair with the post above
- Eval awareness in Claude Opus 4.6's BrowseComp performance (Anthropic) (Mar 06, 2026) (free, official) — Models can detect when they're being evaluated and behave differently. Read this before designing your eval suite or you'll bake the bias in
- Designing AI-resistant technical evaluations (Anthropic) (Jan 21, 2026) (free, official) — Companion concern: how to design evals that don't get gamed by the very models you're scoring. Required reading if you're rolling your own benchmark
- τ²-bench repository (free, GitHub) — Multi-turn customer-service evals with policy compliance
- Establishing Best Practices for Building Rigorous Agentic Benchmarks (arXiv) (free) — Read this before designing anything original. SWE-bench, KernelBench, and WebArena all overestimate by 5–33%
*What to focus on:* how to write a deterministic grader where you can, how to calibrate an LLM judge against human grades, when pass^k matters more than pass@1, and how to detect and discard contaminated benchmarks
Practice project: Build a regression harness around your Phase 2 research agent
Build a golden dataset of 30–50 hand-graded research questions across three difficulty levels (Level 1/2/3, GAIA-style)
Implement deterministic graders where possible (exact-match on factual queries) and an LLM-as-judge scorer with a 5-criterion rubric for open-ended ones
Build a trajectory eval: did the agent plan, spawn ≥2 sub-agents, cite sources, finish under budget?
Wire it into GitHub Actions: every PR runs the full suite. Block merge if golden-set pass rate drops by ≥3 points or any pass^4 metric drops
Add production sampling: 1% of live traces get auto-graded by LLM-as-judge nightly. Alert on drift
Re-run the agent against at least one published benchmark via Inspect: GAIA Level 1 or τ²-bench retail. Compare your numbers to public leaderboards
Ship a make eval target that emits three artifacts: a CI pass/fail summary, a LangSmith experiment URL, and an Inspect log file with one canonical benchmark score
Phase 4 Milestone
By the end of this phase you should be able to:
- Pick one observability platform and defend the choice on architectural grounds
- Implement all four eval types. Single-turn, trajectory, LLM-as-judge, end-state
- Maintain a golden dataset that grows from production failures, not synthetic data
- Block PRs in CI when eval scores regress
- Produce a make eval target that emits a CI pass/fail summary, a LangSmith experiment URL, and an Inspect log file with one canonical benchmark score
- Document the failure modes you found in your own agent. That document is the actual product
Phase 5: Production hardening (ongoing)
Your goal this phase: Take everything you've built and make it survive contact with real users, real cost, and real failures
This is permanent, not a phase you finish
What to learn
1. Cost discipline
Use prompt caching aggressively. Anthropic's caching saves up to 90% on repeated prefixes. Cache your CLAUDE.md, system prompt, and tool definitions
Route by difficulty: Haiku 4.5 or Sonnet 4.6 for simple turns, Opus 4.7 for planning and hard reasoning
The "advisor tool" beta (Anthropic, March 2026) lets you pair an executor with a higher-IQ advisor mid-generation
Watch the Opus 4.7 tokenizer: same sticker price as 4.6 but ~1.0–1.35× more billable tokens for the same text. Re-measure cost-per-task after migrations
Batch API for non-real-time workloads gets 50% off
For multi-agent (Anthropic-style research): expect ~15× the tokens of single-agent chat. Only run multi-agent when the answer's value clears that bar
Resources:
- Open Models have crossed a threshold (LangChain) (free) — GLM-5 and MiniMax M2.7 now match closed frontier models on core agent tasks (file ops, tool use, instruction following). Read this before locking your model selection and routing strategy
*What to focus on:* prompt caching boundaries, model routing rules, batch vs real-time decisions, and a hard cost-per-task budget you monitor
2. Latency
Parallel tool calls. Anthropic's research-system prompt literally says "you MUST use parallel tool calls when creating multiple sub-agents." Same applies to your own agents
Streaming partial outputs to UI via LangGraph's stream_mode="updates"
Sub-agent fan-out is the single biggest latency lever: a 60-step sequential agent becomes a 10-step lead plus 5 parallel 10-step sub-agents
*What to focus on:* where parallelism is safe, where streaming changes the UX, and how fan-out interacts with cost
3. Safety and sandboxing
All code execution in a sandbox: Modal, E2B, Daytona, or LangSmith Sandboxes (private preview, March 2026). Never exec() model output in your main process
Credentials brokered outside the model context (Anthropic Managed Agents pattern; Composio handles this for SaaS auth)
Hooks for guardrails: PreToolUse hooks that block destructive Bash, regex-block secrets, validate file-write paths
Human-in-the-loop interrupts on any irreversible action (LangGraph's interrupt() plus HumanInTheLoopMiddleware, Claude Agent SDK's permission prompts)
Resources:
- Modal docs (free tier) — The default sandbox for Python code execution
- E2B docs (free tier) — Code-execution sandboxes designed for AI agents
- Beyond permission prompts: making Claude Code more secure and autonomous (Anthropic) (Oct 20, 2025) (free, official) — The foundational sandboxing post. How Claude Code stops asking permission for safe actions and contains the unsafe ones. The pattern your harness should copy
- Claude Code auto mode: a safer way to skip permissions (Anthropic) (Mar 25, 2026) (free, official) — The follow-up. What changes when you let the agent run unattended. Read both before you flip any "skip confirmation" flag in production
*What to focus on:* which actions are reversible, which require human approval, and how to make sure the model never sees the credential it uses
4. Monitoring and drift
100% trace sampling at low scale; downsample to 1–10% with stratified sampling on errors at high scale
Alerts on: token cost per request, tool-call failure rate, LLM-as-judge mean score (nightly), p95 latency, eval regression
Re-baseline evals after every model upgrade
Anthropic's own engineering blog warns: "harnesses encode assumptions about what Claude can't do on its own; those assumptions go stale as models improve" (the "context anxiety" example with Sonnet 4.5 → Opus 4.5)
*What to focus on:* what to alert on vs what to log, how to detect prompt-cache invalidation, and how to spot harness ossification when the model has moved past it
5. Resilience
Durable execution (Inngest, Temporal, or LangGraph PostgresSaver) is non-negotiable for any agent that runs over 60 seconds
Checkpoint after every node. Rewind and fork should be possible. Pydantic Deep Agents and LangGraph both support this. The Claude Agent SDK's session log is equivalent
Resources:
- How My Agents Self-Heal in Production (LangChain) (free) — A working pipeline that detects regressions after every deploy, triages the cause, and opens a fix PR with no human in the loop until review. Steal the pattern
*What to focus on:* which kinds of failures you can recover from automatically, which need human escalation, and how to test your resume path before production traffic forces the issue
Phase 5 Milestone
This phase doesn't end. But you should have:
- Prompt caching wired across system prompt, CLAUDE.md, and tool definitions
- A model-routing layer with hard cost-per-task budgets and alerting
- A sandbox for all code execution and a credential broker keeping secrets out of context
- Hooks blocking destructive actions and forcing human approval on irreversible ones
- Trace sampling, drift alerts, and a re-baselining ritual on every model upgrade
- A durable-execution layer so a process kill is a non-event
Recommendations
*Decision-ready takes you can act on today*
If you only learn one framework: LangGraph 1.0 + Deep Agents
It's the most general one, and its runtime story is the most mature today (PostgresSaver, time-travel debugging, durable execution, OTEL-friendly observability via LangSmith), it's model-agnostic, and the abstraction (state graph plus middleware) is a generalizable mental model
Full stop
If you only learn one harness as a reference: Claude Agent SDK plus Claude Code
It is the reference example. CLAUDE.md, Skills, sub-agents, hooks, plan mode, the filesystem-as-memory pattern. Every other harness in 2026 is converging on these primitives
Use Claude Code daily, read its docs, study the open-source harness compendiums
If you only read one thing on context
Anthropic's "Effective context engineering for AI agents" (Sep 2025)
If you only read two: add LangChain's "Context Engineering for Agents" for the Write/Select/Compress/Isolate framework
If you only learn one observability tool
LangSmith if you're staying on LangGraph
Braintrust if you want framework-agnostic CI gating
Inspect if you want benchmark-grade rigor (and you should, eventually)
Skip in 2026
AutoGen v0.4 (merged into Microsoft Agent Framework, community lineage is AG2. Neither is a strong default)
OpenAI Swarm (officially superseded, explicitly "not production-ready" per OpenAI's own README)
The Assistants API (sunsetting mid-2026)
Building your own vector store or memory before you've measured an actual recall problem
"No-code" agent platforms unless you're building something throwaway
Use only when you have a specific reason
CrewAI. Fastest idea-to-prototype, fragile in production. Use for hackathons and demos
OpenAI Agents SDK. Fine if you're OpenAI-locked. The April 2026 update added sandboxing and harness, but you're still tied to OpenAI models
Pydantic AI / Pydantic Deep Agents. Pick if you're a strict-types FastAPI shop
Mastra. Pick only if your team is TypeScript and can't use Python. v1.0 Jan 2026, YC W25, 22k+ stars, built by the Gatsby team
Smolagents. Best teaching tool for code-agent patterns (its 1,000-line codebase is hackable). Production-weak
DSPy 3.0 + GEPA. When you have a metric and want to programmatically optimize prompts and agent topology. GEPA outperforms RL by 6% with 35× fewer rollouts (ICLR 2026 oral)
Letta / MemGPT. If you need OS-style agent self-managed memory across sessions. Otherwise filesystem plus Mem0 is simpler
Benchmarks to bookmark (May 2026 numbers)
SWE-bench Verified: Claude Opus 4.7 ≈ 87.6%, GPT-5.5 ≈ 88.7%, Gemini 3.1 Pro ≈ 78.8%
Terminal-Bench 2.0: GPT-5.5 82.7%, Opus 4.7 ~70%, Gemini 3.1 Pro ~68%
τ-bench: Claude Mythos Preview 89.2% leads
BrowseComp: GPT-5.5 90.1%, Gemini 3.1 Pro 85.9%, Opus 4.7 79.3% (a regression from 4.6's 83.7%. Route web research to GPT-5.5)
GAIA / Princeton HAL: Sonnet 4.5 leads at 74.6%
Time-boxed milestones for a technically strong engineer new to agents
Week 2: Phase 0 done. You can explain a harness in plain English
Week 5: Phase 1 done. Claude Agent SDK agent shipped with one Skill, one hook, one sub-agent
Week 9: Phase 2 done. LangGraph deep-agent research analyst running with PostgresSaver durability and LangSmith traces
Week 13: Phase 3 done. 1,500-line mini-harness, written and documented, comparable in capabilities to a stripped Claude Agent SDK
Week 17: Phase 4 done. Golden datasets, CI gates, one published-benchmark run via Inspect
Forever: Phase 5
If you're moonlighting at 10–15 hours/week, multiply by ~2.5×
The benchmarks that change the plan: if you can't get Phase 1 working in 3 weeks, your tool design is wrong (re-read "Writing tools for agents"). If Phase 2 takes more than 5 weeks, you're trying to build the harness too. Drop down to Deep Agents and stop fighting it
Caveats
*Things that will trip you up if you don't see them coming*
Benchmarks are moving targets and partially gamed
SWE-bench Verified scores went from 1.96% to 80%+ in two years
τ-bench's pass^k consistency metric was added precisely because single-run accuracy stopped being informative
Treat any "X model scored Y%" claim as joint with the harness, the scaffold, the retry budget, and the system prompt. Not the model alone
Multi-agent is overhyped for most use cases
The 90.2% improvement Anthropic reported is for breadth-first research specifically
For coding and tightly coupled tasks, multi-agent often performs worse than single-agent and burns 15× the tokens
Default to single-agent plus sub-agents for scoped exploration. Reach for full multi-agent only when the task decomposes naturally
The counter-example to bookmark: Anthropic's "Building a C compiler with a team of parallel Claudes" (Feb 05, 2026) at https://anthropic.com/engineering/building-c-compiler shows a coding task where parallel sub-agents did pay off. Multi-agent isn't dead for code, it just needs the right decomposition
Speculation flags in 2026 sources
Several "AI 2027" projections (OpenBrain $45B revenue, etc.) are explicitly fictional but get cited as stats. Ignore them
Launch-week reception articles are anecdotal. Treat them as signal about developer sentiment, not as benchmarks
The framework landscape can shift again
LangChain's own framing has moved twice in 18 months (chains → graphs → harnesses-on-graphs)
Any of Pydantic AI, Mastra, or Deep Agents could be much bigger in 12 months
Bet on the abstractions (loop, tools, context, sub-agents, durability, traces) more than any one library. Those carry over
MCP's production rough edges are real
Streamable HTTP behind load balancers, multi-tenant auth, rate limiting, audit logging. All are explicitly on the 2026 MCP roadmap, meaning they're not solved yet
Plan for the next-gen transport SEPs landing in late 2026 and don't deeply couple to the current session model
Model-specific behavior changes between point releases
Opus 4.7's stricter instruction-following and new tokenizer mean your Opus 4.6 prompts may behave differently and cost up to 35% more in tokens for the same text
Re-replay traffic on every model bump
Your eval suite will rot
A golden dataset built today will saturate within months as models improve
Plan to grow it 10–20% per quarter from production failures, not from synthetic data
Keep human calibration on LLM-as-judge running indefinitely
Some sources in this roadmap are vendor-marketed
Lean on primary sources (Anthropic engineering blog, LangChain blog, OpenAI announcements, arXiv) where possible
The ranking-style "best of 2026" posts (Alice Labs, Channel.tel, GuruSup, Morph, Vstorm) are useful triangulation but each has commercial incentives
Where they agree with each other and with primary engineering sources, treat the consensus as reliable
Conclusion
What you can expect after working through this roadmap???
I'm going to be honest with you, without any sugar
This roadmap will not make you a principal AI engineer in 17 weeks
But it will make you someone who can build and ship agent systems that survive production traffic
That happens to be the thing companies are paying for right now
The demand for engineers who can ship production agents is not slowing down
57% of teams in the LangChain State of Agent Engineering report already have agents in production, and 89% of those have observability wired
Quality is the #1 barrier (32%), which means the entire field is bottlenecked on engineers who can build evals and harnesses, not on engineers who can call an LLM API
Anthropic's own number captures the real opportunity: same model, different harness, 78% vs 42% on CORE
That gap is your job
The harness-engineering shift is the largest mispricing in software hiring right now
Companies still post "prompt engineer" roles
What they need is engineers who can take a frontier model and turn it into a production system that is measurable and durable
Now here is what I want you to take away from all of this:
- Pick one project from each phase and build it. Not read about it. Build it, break it, fix it, deploy it, then put a LangSmith trace and a benchmark score in your README. The engineers who get hired are the ones who can show a trace, not the ones who can recite a framework comparison table
- Start sharing what you learn. Write up your mini-harness post-mortem. Publish your golden-dataset findings. Post your benchmark numbers with the harness configuration that produced them. Teaching is the fastest way to learn and it builds your reputation at the same time. The best opportunities come from engineers who are visible, not from engineers who applied to 500 listings
- And please don't wait until you feel ready. You will never feel ready. The gap between "I'm reading the LangChain blog" and "I'm shipping a deep agent with PostgresSaver durability" is where most engineers get stuck forever
- Start applying, start building in public, start shipping the moment you have a working agent. Even if it's small. The market doesn't reward perfection. It rewards engineers who can make the model do something real and prove it didn't regress
17 weeks is enough to change everything if you put in the work
And I believe each of you reading this can do it
Just keep building and keep measuring what you build
*This article is written by the author on his internal notes and notes compiled over 2 to 3 months, and it is edited by Minimax 2.7
I have a content pipeline on Obsidian which powers these and writes according to my style using my handwritten and hand-typed notes. *




