How to Become an AI Engineer in 2026 (Builder's Roadmap) cover

How to Become an AI Engineer in 2026 (Builder's Roadmap)

Avid avatar

Avid · @Av1dlive · May 6

View original post

You can build the next $1B company in a weekend using AI

## And the only skill you need to learn is

TLDR; if you don't want to read 7,862 words of roadmap, then you can simply just give this link to your agent to personalise a roadmap ➡️https://raw.githubusercontent.com/codejunkie99/agent-roadmap-2026/main/AGENT.md

agentic ai & harness engineering

The problem is that most engineers have no clear idea what they should learn

Some pick CrewAI because the role-based demos look slick on Twitter

Some chase every new framework that ships and never finish anything real

Others jump straight into multi-agent systems without understanding context, tools, harnesses, or evals

The result is usually the same: a lot of framework tourism and very little production-ready skill

If your goal is to become an agent engineer in 2026, you don't need to learn 12 frameworks

You need to learn how to build, harness, evaluate, and ship real agent systems in production

That means learning how to:

  • build agents on a real orchestration runtime like LangGraph
  • work with the Claude Agent SDK as a reference harness
  • engineer context properly with Write, Select, Compress, Isolate
  • write tools the model picks correctly
  • add memory, durability, and sandboxing for production traffic
  • build evals, trajectory checks, and CI regression gates
  • ship agents that survive contact with real users and real cost

This guide is a 6-phase roadmap built on what shipped in late 2025 and early 2026

The piece is 7,000+ WORDS and pulls from primary sources only

But its real value is that every phase has a concrete project, a canonical reading list, and the exact resources you need

That way, within roughly 17 WEEKS of focused work, you can reach the level of agent engineer who can own a production AI feature end to end

Researching this took more than 60 HOURS of reading primary engineering blogs, papers, and shipping-engineer surveys

Now let's start reading the roadmap ⬇️

What an Agent Engineer does in 2026

A lot of people hear "AI agent engineer" and imagine someone gluing together CrewAI roles and calling it shipped

In reality, most modern agent engineers do something much more practical

They build, harness, and operate agent systems on top of frontier models

That usually includes:

  • designing the agent loop and tool dispatch
  • engineering context with Write, Select, Compress, Isolate
  • writing tools the model selects correctly
  • orchestrating sub-agents with isolated context windows
  • adding skills, memory, durability, and sandboxing
  • wiring evals, traces, and CI gates so "better" becomes measurable

Same model, different harness, completely different result

Anthropic's own measurement: Opus 4.5 scored 78% on CORE inside Claude Code, and 42% inside Smolagents

Same model. Full stop

That gap is harness engineering, which is what this roadmap is about

The four context primitives every agent builder needs to know: Write (scratchpads, memory files), Select (retrieval at the point of use), Compress (summarization at 85–95% of the context window), Isolate (sub-agents with their own context windows)

Anthropic's multi-agent research system beat single-agent Opus 4 by 90.2% on breadth-first research using exactly this pattern, while burning ~15× the tokens

In practice there are only two stacks worth learning deeply in 2026: LangGraph 1.0 + Deep Agents, and the Claude Agent SDK

The rest are either fading out, getting absorbed, or worse versions of these two for production

Free resources to follow throughout the roadmap

These are the blogs, courses, channels, and newsletters that ship signal for free

Subscribe to them in Phase 0 so the rest of the roadmap lands on a steady drip of new posts, case studies, and primary-source updates

None of these are paywalled, and most of them update faster than any textbook ever could

Engineering blogs to subscribe to

Resources:

  • Anthropic engineering blog (free, official) — If you read one blog, read this one. Context engineering, harness design, multi-agent research, advanced tool use, evals. All primary sources, all referenced repeatedly across this roadmap
  • LangChain blog (free) — Where the harness, middleware, and Deep Agents discipline get formalized in public. Read everything by Lance Martin, Vivek Trivedy, and Harrison Chase
  • OpenAI Cookbook (free, GitHub) — Working notebooks for every API feature. Tool use, structured outputs, evals, agents. Type along
  • Hamel Husain's blog (free) — "Your AI Product Needs Evals" is the eval essay everyone links to. Everything else on the site is in the same league. If you build evals, read this twice
  • Eugene Yan's blog (free) — "Patterns for Building LLM-based Systems & Products" is the practitioner write-up everyone references. Opinionated and calibrated against real shipping experience
  • Lilian Weng's blog (free) — Long-form deep dives on agents, prompt engineering, hallucination, alignment. The clearest synthesis writing in the field
  • Simon Willison's blog (free) — Daily notes from a senior engineer who ships. Good for sanity-checking hype and catching weird edge cases first
  • Chip Huyen's blog (free) — ML systems from first principles. Her "Building LLM applications for production" piece is required reading before Phase 5
  • Phil Schmid's blog (free) — Practical end-to-end guides on HuggingFace, Gemini, fine-tuning, deployment. Always shows the code
  • Cameron Wolfe writes Deep (Learning) Focus (free) — Long-form paper breakdowns. Catch up on a research area in one read

Free courses worth completing

Resources:

  • DeepLearning.AI Short Courses (free) — Short 1–2 hour courses, almost all free. The LangGraph course (built with LangChain) and Andrew Ng's "Agentic AI" course (Reflection, Tool Use, Planning, Multi-Agent design patterns) are the two to complete in Phase 0
  • LangChain Academy: Introduction to LangGraph (free) — The official free course. State, memory, human-in-the-loop, multi-agent. Do this in Phase 2
  • Anthropic Interactive Prompt Engineering Tutorial (free, GitHub) — Nine chapters as Jupyter notebooks against the Claude API. The fastest way to build prompting muscle
  • HuggingFace Agents Course (free) — End-to-end coverage of agents, smolagents, MCP, and evaluation. Free certificate
  • HuggingFace LLM Course (free) — Foundations: tokenization, transformers, fine-tuning. Useful background even if you only build on APIs
  • MCP Fundamentals on FreeAcademy (free) — Build MCP servers, connect them to Claude, write custom tools. The fastest path to MCP literacy

YouTube channels and talks

Resources:

  • Andrej Karpathy (free) — Neural Networks: Zero to Hero builds GPT from scratch in raw Python. His 2026 "Vibe Coding to Agentic Engineering" talk at Sequoia AI Ascent is the clearest take on why harness engineering matters now
  • AI Engineer (free) — All AI Engineer Summit and World's Fair talks. Search for talks by Hamel Husain, swyx, Anthropic engineers, and Erik Schluntz
  • LangChain (free) — Weekly tutorials on LangGraph, Deep Agents, middleware, and integrations. Often the first place new features land in video form
  • Anthropic (free) — Talks from Anthropic engineers. Multi-agent research walkthroughs, Claude Code internals, Skills
  • Yannic Kilcher (free) — Paper breakdowns. Saves you reading every arXiv preprint yourself
  • Lex Fridman Podcast on YouTube (free) — Long-form interviews with the people building and researching AI. Karpathy, Schulman, Sutskever, Amodei

Newsletters worth subscribing to

Resources:

  • Latent Space by swyx and Alessio (free) — The technical newsletter for AI engineers. AINews daily roundup, podcast, and the annual "AI Engineering Reading List". If you only subscribe to one, this is it
  • The Batch by Andrew Ng (free) — Weekly broad-spectrum coverage. Good for noticing when something new is breaking out
  • Import AI by Jack Clark, Anthropic co-founder (free) — Policy plus research roundup. Closest thing to a strategic context briefing for the field
  • Ben's Bites (free) — Daily AI news in five minutes. Skim only. Useful for catching announcements you'd otherwise miss
  • TLDR AI (free) — Daily digest, low-noise. Pair with one of the deeper newsletters above
  • AI Engineer Pack by swyx (free) — Curated free credits, tools, and resources for AI engineers. Updated continuously

Open-source repos worth studying

Resources:

  • Anthropic Cookbook (free, GitHub) — Reference implementations of every workflow pattern. Already on the Phase 0 list. Re-read it after each phase
  • OpenAI Cookbook (free, GitHub) — Same idea, OpenAI side. Tool use, structured outputs, evals, agents
  • deepagents by LangChain (free, GitHub) — The reference open-source harness on top of LangGraph. Read the middleware files when you build your own harness in Phase 3
  • LangGraph examples (free, GitHub) — Runnable LangGraph patterns. Supervisor, hierarchical teams, planning, customer support agent
  • inspect_evals (free, GitHub) — 200+ standard evals as a Python package. GAIA, SWE-bench, Cybench, BFCL
  • awesome-agentic-engineering-resources (free, GitHub) — Community-curated index of agent engineering resources. Use to fill gaps this roadmap doesn't cover

Podcasts for the commute

Resources:

  • Latent Space (free) — Long-form interviews with the people shipping the field. Anthropic, OpenAI, LangChain, Modal, E2B all on the guest list
  • Dwarkesh Podcast (free) — Long interviews on AI strategy, capability, and policy. Long-form, primary sources
  • The TWIML AI Podcast by Sam Charrington (free) — Weekly technical interviews with researchers and engineers
  • Practical AI (free) — Engineering-focused. Less hype, more shipping
  • The MAD Podcast by Matt Turck (free) — Founder plus investor lens on the data and AI ecosystem. Useful for tracking who is shipping vs raising

Communities worth joining

Resources:

  • LangChain Discord (free) — Where you'll find the LangGraph and Deep Agents core team. Active #help channels
  • HuggingFace Discord (free) — Largest open-weights and ML community
  • r/LocalLLaMA (free) — Open-weights model news, benchmarks, and tooling. Often faster than the official channels
  • AI Engineer World's Fair (free with signup) — The professional network of the field. Job postings, hiring channels, working groups
  • Anthropic Discord (free) — Claude developer community. Skills sharing, hooks patterns, MCP servers

*What to focus on:* pick one blog, one newsletter, one podcast, and one community in Phase 0. Don't try to follow all 40+ resources at once

Add more only when the existing ones stop surprising you

The point of this list is breadth so you can choose, not a checklist to complete

Phase 0: Foundations (1–2 weeks)

Your goal this phase: Build correct mental models. Don't write a single line of agent code yet beyond throwaway scripts

Most beginners skip this phase, dive straight into framework tutorials, and end up with code they can't reason about when it fails. Don't skip it

What to learn

1. The augmented LLM and the workflow vs agent distinction

Before you touch a framework, you need to understand the five workflow patterns Anthropic identified (prompt chaining, routing, parallelization, orchestrator-worker, evaluator-optimizer) and why a workflow is not the same thing as an agent

A workflow has a fixed control flow you wrote

An agent makes its own control-flow decisions inside a loop

This distinction will save you from building agents that should have been chains

Resources:

  • Building Effective Agents by Anthropic (Erik Schluntz and Barry Zhang) (Dec 2024) (free, official) — The five workflow patterns plus the augmented-LLM concept. Everyone in the field cites this. Read it first
  • Anthropic Cookbook (patterns/agents folder) (free, GitHub) — Reference implementations of every workflow pattern as runnable notebooks. Type along, don't just read
  • Simon Willison's annotations of Building Effective Agents (free) — A senior engineer's sanity-check perspective on the same paper

*What to focus on:* the difference between a workflow and an agent, the augmented-LLM mental model, the orchestrator-worker pattern, why parallelization usually beats sequential reasoning, and the failure modes Anthropic explicitly warns about

2. Context engineering as a discipline

Prompt engineering is dead as a standalone skill in 2026. The replacement is context engineering: deciding what tokens are in front of the model at every step of the loop

Resources:

  • Effective context engineering for AI agents by Anthropic (Sep 29, 2025) (free, official) — Read this one twice. Memorize the framing
  • Context Engineering for Agents by Lance Martin (LangChain) (free) — The Write, Select, Compress, Isolate framework. The one mental model you need
  • How we built our multi-agent research system by Anthropic (Jun 2025) (free, official) — The orchestrator-worker reference architecture, the 90.2% breadth-first research improvement, and the 15× token caveat
  • Simon Willison's annotations of the multi-agent research post (free) — Sanity-check perspective on the architecture and the cost trade-offs

*What to focus on:* what each of Write, Select, Compress, and Isolate means in code, why sub-agents are an isolation primitive (not a parallelism primitive), and when you would use compaction vs offloading vs summarization

3. The harness as an operating system

A clean walkthrough of what "harness" means

Resources:

  • The Complete Guide to Harness Engineering (ClaudeCodeLab) (free) — Three-level harness escalation with runnable code
  • Inside the Claude Agents SDK (ML6) (free) — The CPU/RAM/OS/App analogy plus the 78% vs 42% Opus 4.5 number that motivates this whole roadmap
  • Building agents with the Claude Agent SDK (Anthropic) (free, official) — Why the SDK exists, why it was renamed from Claude Code SDK
  • Effective harnesses for long-running agents by Anthropic (Nov 26, 2025) (free, official) — Anthropic's own harness primer. Read alongside Vivek Trivedy's posts to triangulate the same ideas from a different team
  • Harness design for long-running application development by Anthropic (Mar 24, 2026) (free, official) — The follow-up. What changes when sessions stretch to hours and days. Phase 3 essential reading too
  • How to think about agent frameworks by Harrison Chase (LangChain) (free) — The orchestration framework vs abstraction distinction. Required before you pick anything

*What to focus on:* the loop, tool dispatch, context curation, persistence, hooks, sub-agent orchestration, observability. And how each of those gets implemented in any harness you'll meet

4. The 2026 state of the field

Resources:

  • State of Agent Engineering (LangChain) (free) — 1,340 respondents, Nov–Dec 2025. Get the numbers in your head: 57% of teams in production, 89% have observability, 52% have evals, quality (32%) is the #1 barrier
  • How to Build an Agent (LangChain) (free) — The "smart intern" framing for scoping what an agent should and should not own
  • Continual learning for AI agents by Harrison Chase (LangChain) (free) — Three layers where agents actually learn: weights, prompts, memory. The framing you need before you reach for fine-tuning anything

*What to focus on:* where teams struggle in production (quality, cost, reliability), what the median stack looks like, and where the marginal hour of effort pays off

Practice project: Write a 2-page personal doc, by hand, that defines in your own words: workflow vs agent, augmented LLM, the four context-engineering primitives, the orchestrator-worker pattern, the difference between harness, model, and framework, and the top three failure modes you expect to see in your own code

This document is the actual deliverable

If you can't write it without looking, you haven't read carefully enough

Phase 0 Milestone

By the end of this phase you should be able to:

  • Explain what an agent is and how it differs from a workflow without using framework jargon
  • Name the four context-engineering primitives and give a code-level example of each
  • Explain why the harness contributes more than the model in 2026
  • Describe the orchestrator-worker pattern and the 15× token cost trade-off
  • Pick a framework on architectural grounds, not vibes

Phase 1: Build your first simple agent (2–3 weeks)

Your goal this phase: Write a tool-using agent twice. Once with Anthropic's raw SDK, once with the Claude Agent SDK harness. Feel the difference between rolling your own loop and standing on a real harness

This is the cheapest possible way to understand what a harness gives you

What to learn

1. The agent loop from scratch

The loop is not magic. You call the model with messages and tools, you parse out tool_use blocks, you execute the tools, you append tool_result, you loop until stop_reason equals end_turn

Once you've written this in ~100 lines yourself, every framework becomes readable

Resources:

  • Tutorial: Build a tool-using agent (Anthropic docs) (free, official) — The reference for tool_use, tool_result, parallel tool calls, and the response loop
  • Writing tools for agents (Anthropic) (free, official) — Read this before you design any tool. The descriptions for your tools and their parameters are the user manual for the LLM
  • Equipping agents for the real world with Agent Skills (Anthropic) (free, official) — The progressive-disclosure pattern explained by the team that wrote the spec

*What to focus on:* how the request/response loop terminates, what stop_reason values mean, how parallel tool calls are encoded, error recovery when a tool throws, and how to design a tool description so the model picks it correctly

Practice: Build a "from scratch" agent in 100 lines using anthropic.messages.create with a tool spec. Three tools: web_search via Tavily or Firecrawl, read_file, write_file. No framework. Run it on a research task and read every step of the trace

2. The Claude Agent SDK as the canonical harness

The Claude Agent SDK is the same harness that powers Claude Code

You will study it as a reference and use it as your day-1 tool

Resources:

  • Claude Agent SDK docs (free, official) — The Python and TypeScript SDKs, hooks, sub-agents, skills, and the Task tool
  • Claude Agent SDK, Skills reference (free, official) — How SKILL.md files work, the metadata frontmatter, progressive loading
  • claude-code-best-practices by Muhammad Usman GM (free, GitHub) — Skim, don't copy wholesale. Useful for seeing what real users do
  • claude-code-best-practice by Shan Raisshan (free, GitHub) — Companion compendium with a different curation slant
  • Evaluating Skills (LangChain) (free) — How LangChain measures whether a Skill is actually pulling its weight. Useful once you've written your first Skill in this phase and want to know if it's helping or hurting

*What to focus on:* the CLAUDE.md system-prompt pattern, how Skills are loaded progressively, the PreToolUse and PostToolUse hooks, spawning sub-agents via the Task tool, and how the SDK handles permission prompts

Practice: Rebuild the same agent from the previous topic using claude-agent-sdk. Add a CLAUDE.md with project conventions. Add one Skill (folder with SKILL.md) that defines a "research-summary" output format. Add one PostToolUse hook that auto-formats any file the agent writes. Spawn one sub-agent for a sub-task using the Task tool

3. Ship something tiny

Tutorials don't count. You need a thing that runs on a schedule and that you read the output of

Practice project: A daily-briefing agent that reads your local Markdown notes and a couple of RSS feeds, produces a summarized briefing with citations, and writes it to disk. Cron it via launchd or systemd. Run it for a week. Watch it fail. Fix it

Phase 1 Milestone

By the end of this phase you should be able to:

  • Write a tool-using agent loop in under 100 lines without a framework
  • Explain what stop_reason values mean and how parallel tool calls work
  • Build the same agent on the Claude Agent SDK with a Skill, a hook, and a sub-agent
  • Articulate, in 200 words, what the harness gave you for free that you wrote yourself in the from-scratch version

Phase 2: Build a real agent with proper architecture (3–4 weeks)

Your goal this phase: Build a multi-step, persistent, stateful agent on LangGraph 1.0 + LangChain create_agent + Deep Agents

This is the stack you'll likely run in production. The conceptual model (state machine of nodes and edges, middleware, checkpointer) generalizes everywhere

Why this stack and not Pydantic AI, OpenAI Agents SDK, or CrewAI:

  • LangGraph is the only framework in the Alice Labs and Channel.tel "what ships" rankings that combines durable execution, checkpointing, human-in-the-loop, first-class observability via LangSmith, and middleware

create_agent (LangChain 1.0, Oct 2025) is now the default agent factory built on the LangGraph runtime. Create_react_agent is deprecated

Deep Agents (LangChain, launched Aug 2025; v0.5 alpha April 2026) is a batteries-included harness on top. Planning, virtual filesystem, sub-agents, summarization, skills. And it's the closest open-source analog to Claude Code's harness, but model-agnostic

What to learn

1. The LangGraph runtime

A state graph of nodes and edges, with a checkpointer that lets you resume, rewind, and fork

Resources:

  • LangGraph docs (free, official) — The runtime reference. Start with the concepts page, then the quickstart
  • Doubling down on Deep Agents (LangChain) (free) — Defines harness vs framework vs runtime cleanly
  • Context Management for Deep Agents (LangChain) (free) — The 20K-token tool-response offload pattern and the 85% context-window compression triggers
  • On Agent Frameworks and Agent Observability (LangChain) (free) — Why LangSmith is OTEL-friendly and works without LangChain. Useful even if you choose another platform later
  • Deep Agents v0.5 (LangChain) (free) — The April 2026 release notes. Async (non-blocking) sub-agents, expanded multi-modal filesystem support, async TODOs. Read this before you pin a deepagents version in your project

*What to focus on:* state schemas, nodes, edges, conditional edges, the PostgresSaver checkpointer, time-travel debugging, human-in-the-loop interrupts, and how middleware composes

2. Middleware as the customization layer

Middleware is how you customize a packaged agent without forking it

Resources:

  • How Middleware Lets You Customize Your Agent Harness (LangChain) (Mar 26, 2026) (free) — The before_agent, wrap_model_call, before_tools, after_tools hooks. Required reading
  • Introducing ambient agents (LangChain) (free) — Background-agent UX patterns: notify, question, review

*What to focus on:* where each hook fires in the agent lifecycle, how SummarizationMiddleware and FilesystemMiddleware compose, how to write a custom middleware in 30 lines, and when middleware is the right answer vs writing a new node

3. Tools, MCP, and the code-execution pattern

The naive "load all MCP tools into context" pattern is broken. The correct pattern is code execution with MCP

Resources:

  • Code execution with MCP (Anthropic) (Nov 2025) (free, official) — The 150K → 2K token reduction. Read this before you wire any MCP server
  • Introducing advanced tool use (Anthropic) (free, official) — defer_loading: true cut tool tokens 85% and lifted Opus 4.5 MCP eval from 79.5% to 88.1%
  • Scaling Managed Agents (Anthropic) (free, official) — The session, harness, and sandbox separation. Read it even if you don't use Managed Agents
  • Composio docs (free tier) — 200+ SaaS integrations, MCP gateway built in, brokers credentials so they never enter the model context
  • Arcade docs (free tier) — Use when you need fine-grained per-user identity rather than service-level auth

*What to focus on:* defer_loading, code execution as a tool surface, why round-tripping JSON through the model is expensive, and how Composio or Arcade brokers SaaS auth without leaking credentials into the model context

4. Memory choices that aren't a vector DB

Resources:

  • Letta MemFS benchmark on LoCoMo (free) — The April 2026 result: filesystem-based memory on GPT-4o-mini hit 74% on LoCoMo, beating bespoke memory tools
  • Mem0 docs (free) — User-scoped knowledge memory. Pick this for cross-session user facts

*What to focus on:* the three memory layers (thread-scoped via PostgresSaver, user-scoped via Mem0/Zep, self-managed via Letta), why filesystem is the right default, and not reaching for a vector DB until you've measured an actual recall problem

Practice project: Build a "research analyst" deep agent

Input: a research question

The lead agent plans, writes a TODO list to a virtual filesystem, and spawns 3 search sub-agents in parallel, each with isolated context

Sub-agents call Tavily or Firecrawl, write results to files, and return short summaries to the parent. Never raw search results into the parent's context

A citation sub-agent verifies claims against retrieved sources

A writer agent produces a final Markdown report with inline citations

All state persists via PostgresSaver. Kill the process mid-run, resume from where it left off

Human-in-the-loop interrupt: agent must ask for confirmation before exceeding $1 in tokens

Wrap the whole thing in a single make demo target that runs the full pipeline end to end

README must articulate: which middleware you used and why, which sub-agents have isolated context, what your context-compression strategy is, and what your durability story is on process kill

Ship a LangSmith trace URL for one full run alongside the README

Phase 2 Milestone

By the end of this phase you should be able to:

  • Build a multi-step LangGraph agent with PostgresSaver durability and human-in-the-loop interrupts
  • Use Deep Agents middleware (planning, filesystem, sub-agents, summarization) as a packaged harness
  • Spawn isolated-context sub-agents and return compressed summaries to the parent
  • Articulate your context-compression strategy and your durability story on process kill
  • Produce a LangSmith trace URL showing the full multi-step trajectory

Phase 3: Build the harness layer yourself (3–4 weeks)

Your goal this phase: Stop using a packaged harness and build a thin one. You'll never make the right harness trade-offs in production until you've built one once

This is the highest-leverage phase in the roadmap

What to learn

1. What "harness" decomposes into

Synthesizing the Deep Agents middleware list, the Claude Agent SDK architecture, and Vivek Trivedy's harness-engineering write-up, the harness is the union of:

  • loop control. The while-loop driving model→tools→model
  • tool dispatch. Registry, schema validation, parallel calls, error recovery, retries
  • context management. System-prompt assembly, message-history compaction at 85–95% of window, tool-response offloading at ~20K tokens, prompt caching
  • persistence. Checkpoint state every node so you can resume, rewind, fork
  • sub-agent orchestration. Spawn isolated-context children, route compressed summaries back
  • skills and progressive disclosure. Load capabilities only when relevant
  • hooks. PreToolUse, PostToolUse, PreCompact, Stop, SessionStart (the Claude Code list is canonical)
  • observability. OTEL spans for every model call, tool call, sub-agent invocation, with token counts and latency
  • sandboxing. Code execution and MCP tool calls happen in a container the model never has direct creds to

auth and secrets brokering. Credentials never enter the model's context (Anthropic Managed Agents pattern)

Resources:

  • The Anatomy of an Agent Harness (LangChain) (free) — The cleanest decomposition of harness components in the public literature. Reference text for the entire phase. Read this before you write a single line of harness code
  • Improving Deep Agents with harness engineering by Vivek Trivedy (LangChain) (Feb 17, 2026) (free) — Went from rank 30 to rank 5 on Terminal-Bench 2.0 only by changing the harness, holding the model fixed at GPT-5.2-codex. The recipe is in the post
  • Better Harness: A Recipe for Harness Hill-Climbing with Evals by Vivek Trivedy (LangChain) (Apr 29, 2026) (free) — The direct sequel. Self-verification and tracing as the recipe for autonomously improving a harness. Read this immediately after the Feb 17 post
  • Inside the Claude Agents SDK (ML6) (free) — The CPU/RAM/OS/App analogy and the 78% vs 42% harness-comparison number
  • everything-claude-code (Cerebral Valley × Anthropic hackathon winner) (free, GitHub) — For inspiration on where to stop adding features
  • deepagents source (free, GitHub) — Read this alongside your own harness as a reference. The middleware files are the core of the harness pattern

*What to focus on:* which harness components are worth writing yourself, which to import, and the order in which features pay off (loop and tool dispatch before sub-agents before durability before observability)

2. Durable execution as an add-on

Resources:

  • Inngest docs (free) — Durable steps and checkpointing went GA in Dec 2025. The easiest path to durability for a Python harness
  • Temporal Python SDK (free) — The OpenAI Agents SDK and Temporal integration shipped in March 2026. Treat each tool call as a durable step

*What to focus on:* idempotency keys per step, retry policies, what happens to in-flight tool calls on process kill, and where your harness's checkpoint boundary should be (per node, not per token)

Practice project: Write mini-harness in ~1,500 lines of Python

A loop wrapping anthropic.messages.create or LiteLLM for model-agnosticism

Tool registry from a Python decorator (@tool) with JSON-schema generation

A CLAUDE.md-style system-prompt loader that reads ./harness/rules/*.md with path-glob matching

A SKILL.md progressive-disclosure loader (aim for under 50 tokens of metadata per skill in context)

A sub-agent spawn primitive with isolated context, returning a summary string back to parent

Filesystem offload: any tool result over 20K tokens is written to ./workspace/<id>.txt and replaced in context with a path plus 10-line preview

Auto-compaction at 85% of context window: summarize messages older than the last 10 turns

A pluggable hook system (pre_tool, post_tool, stop)

OpenTelemetry tracing via opentelemetry-sdk exported to LangSmith or Phoenix (both speak OTEL)

Durable resume: persist message history and state to SQLite after each step, reload by run ID

Optional add-on: wrap the whole thing in Inngest or Temporal so each tool call becomes a durable step

Phase 3 Milestone

By the end of this phase you should be able to:

  • List the ten components of a modern harness and explain when each pays off
  • Write a 1,500-line Python harness with loop, tool dispatch, context compression, sub-agents, hooks, and OTEL traces
  • Wire durable execution via Inngest or Temporal so a process kill is recoverable
  • Produce a 1,000-word post-mortem comparing your mini-harness to the Claude Agent SDK and Deep Agents. What you got right, what you cut, what you'd do differently
  • That post-mortem is the real deliverable. The code is just evidence

Phase 4: Build the eval and regression harness (3–4 weeks)

Your goal this phase: Make your agent measurable. Without this, every "improvement" is vibes

This is where most engineers stall. They can build a great agent and can't tell whether their next change made it better or worse

What to learn

1. Pick exactly one observability platform

Don't run two. The five real options:

  • LangSmith. Pick if you live in LangGraph or LangChain. Native tracing. March 2026 added Sandboxes, the Polly debugging assistant, Skills, and Fleet (agent identity/sharing)
  • Braintrust. Pick if you want framework-agnostic CI quality gates that block PRs. $80M Series B Feb 2026. Flat $249/mo for unlimited users vs LangSmith's $39/seat
  • Arize Phoenix (open source) and Arize AX (managed). Pick if you want OpenTelemetry-native, drift detection, and a clean migration path from OSS to managed
  • W&B Weave. Pick if you're already on Weights & Biases for ML. Now has full agent trace views, MCP auto-logging, and forthcoming A2A tracing

Inspect (UK AISI). Pick for benchmark-grade evals. GAIA, SWE-bench, Cybench, BFCL all ship as inspect_evals packages. Used by Anthropic, DeepMind, and Grok internally

Resources:

  • LangSmith docs (free tier, official) — Production tracing, online evals, experiments, and the new Polly debugging assistant
  • Inspect AI annotated notes by Hamel Husain (free) — Hamel's notes are the practitioner write-up I lean on. Read this before installing Inspect
  • Inspect docs (free, official) — The framework reference
  • inspect_evals (free, GitHub) — 200+ standard evals as a Python package. GAIA, SWE-bench, Cybench, BFCL
  • Braintrust docs (free tier) — Framework-agnostic experiments, CI gates, and golden datasets
  • Agent Evaluation Readiness Checklist (LangChain) (free) — 17-minute practical checklist: error analysis, dataset construction, grader design, offline and online evals, production readiness. Print this and tape it to your monitor for the entire phase
  • Quantifying infrastructure noise in agentic coding evals (Anthropic) (Feb 05, 2026) (free, official) — Flaky sandboxes and network jitter alone can swing eval scores by several points. Before you trust any agent benchmark number (yours or someone else's), read this

*What to focus on:* trace sampling strategy, online vs offline evals, the difference between a metric and a guardrail, and why CI gating is the pattern that turns evals from dashboard wallpaper into a development tool

2. The four eval types you must implement

Per Anthropic's "Demystifying evals for AI agents":

  • Single-turn evals: given this input, is the output right? Cheapest, deterministic graders where possible, run constantly
  • Trajectory evals: did the agent call the right sequence of tools with the right arguments? Test single-step, full-turn, and multi-turn variants
  • LLM-as-judge: for open-ended outputs (research reports, code review). Calibrate against human-graded examples weekly. Anthropic's research-agent rubric used 0.0–1.0 across factual accuracy, citation quality, completeness, source quality, tool efficiency

End-state evals: for stateful agents (DB writes, file edits). Compare the final state of the environment to ground truth. This is τ-bench's approach

Resources:

  • Demystifying evals for AI agents (Anthropic) (free, official) — Anthropic's best primer on the topic
  • Evaluating Deep Agents: Our Learnings (LangChain) (free) — Single-step, full-turn, and multi-turn trajectory eval patterns. The practitioner guide
  • How we build evals for Deep Agents (LangChain) (free) — Companion piece. How they actually source data, design metrics, and run well-scoped evals. Pair with the post above
  • Eval awareness in Claude Opus 4.6's BrowseComp performance (Anthropic) (Mar 06, 2026) (free, official) — Models can detect when they're being evaluated and behave differently. Read this before designing your eval suite or you'll bake the bias in
  • Designing AI-resistant technical evaluations (Anthropic) (Jan 21, 2026) (free, official) — Companion concern: how to design evals that don't get gamed by the very models you're scoring. Required reading if you're rolling your own benchmark
  • τ²-bench repository (free, GitHub) — Multi-turn customer-service evals with policy compliance
  • Establishing Best Practices for Building Rigorous Agentic Benchmarks (arXiv) (free) — Read this before designing anything original. SWE-bench, KernelBench, and WebArena all overestimate by 5–33%

*What to focus on:* how to write a deterministic grader where you can, how to calibrate an LLM judge against human grades, when pass^k matters more than pass@1, and how to detect and discard contaminated benchmarks

Practice project: Build a regression harness around your Phase 2 research agent

Build a golden dataset of 30–50 hand-graded research questions across three difficulty levels (Level 1/2/3, GAIA-style)

Implement deterministic graders where possible (exact-match on factual queries) and an LLM-as-judge scorer with a 5-criterion rubric for open-ended ones

Build a trajectory eval: did the agent plan, spawn ≥2 sub-agents, cite sources, finish under budget?

Wire it into GitHub Actions: every PR runs the full suite. Block merge if golden-set pass rate drops by ≥3 points or any pass^4 metric drops

Add production sampling: 1% of live traces get auto-graded by LLM-as-judge nightly. Alert on drift

Re-run the agent against at least one published benchmark via Inspect: GAIA Level 1 or τ²-bench retail. Compare your numbers to public leaderboards

Ship a make eval target that emits three artifacts: a CI pass/fail summary, a LangSmith experiment URL, and an Inspect log file with one canonical benchmark score

Phase 4 Milestone

By the end of this phase you should be able to:

  • Pick one observability platform and defend the choice on architectural grounds
  • Implement all four eval types. Single-turn, trajectory, LLM-as-judge, end-state
  • Maintain a golden dataset that grows from production failures, not synthetic data
  • Block PRs in CI when eval scores regress
  • Produce a make eval target that emits a CI pass/fail summary, a LangSmith experiment URL, and an Inspect log file with one canonical benchmark score
  • Document the failure modes you found in your own agent. That document is the actual product

Phase 5: Production hardening (ongoing)

Your goal this phase: Take everything you've built and make it survive contact with real users, real cost, and real failures

This is permanent, not a phase you finish

What to learn

1. Cost discipline

Use prompt caching aggressively. Anthropic's caching saves up to 90% on repeated prefixes. Cache your CLAUDE.md, system prompt, and tool definitions

Route by difficulty: Haiku 4.5 or Sonnet 4.6 for simple turns, Opus 4.7 for planning and hard reasoning

The "advisor tool" beta (Anthropic, March 2026) lets you pair an executor with a higher-IQ advisor mid-generation

Watch the Opus 4.7 tokenizer: same sticker price as 4.6 but ~1.0–1.35× more billable tokens for the same text. Re-measure cost-per-task after migrations

Batch API for non-real-time workloads gets 50% off

For multi-agent (Anthropic-style research): expect ~15× the tokens of single-agent chat. Only run multi-agent when the answer's value clears that bar

Resources:

  • Open Models have crossed a threshold (LangChain) (free) — GLM-5 and MiniMax M2.7 now match closed frontier models on core agent tasks (file ops, tool use, instruction following). Read this before locking your model selection and routing strategy

*What to focus on:* prompt caching boundaries, model routing rules, batch vs real-time decisions, and a hard cost-per-task budget you monitor

2. Latency

Parallel tool calls. Anthropic's research-system prompt literally says "you MUST use parallel tool calls when creating multiple sub-agents." Same applies to your own agents

Streaming partial outputs to UI via LangGraph's stream_mode="updates"

Sub-agent fan-out is the single biggest latency lever: a 60-step sequential agent becomes a 10-step lead plus 5 parallel 10-step sub-agents

*What to focus on:* where parallelism is safe, where streaming changes the UX, and how fan-out interacts with cost

3. Safety and sandboxing

All code execution in a sandbox: Modal, E2B, Daytona, or LangSmith Sandboxes (private preview, March 2026). Never exec() model output in your main process

Credentials brokered outside the model context (Anthropic Managed Agents pattern; Composio handles this for SaaS auth)

Hooks for guardrails: PreToolUse hooks that block destructive Bash, regex-block secrets, validate file-write paths

Human-in-the-loop interrupts on any irreversible action (LangGraph's interrupt() plus HumanInTheLoopMiddleware, Claude Agent SDK's permission prompts)

Resources:

  • Modal docs (free tier) — The default sandbox for Python code execution
  • E2B docs (free tier) — Code-execution sandboxes designed for AI agents
  • Beyond permission prompts: making Claude Code more secure and autonomous (Anthropic) (Oct 20, 2025) (free, official) — The foundational sandboxing post. How Claude Code stops asking permission for safe actions and contains the unsafe ones. The pattern your harness should copy
  • Claude Code auto mode: a safer way to skip permissions (Anthropic) (Mar 25, 2026) (free, official) — The follow-up. What changes when you let the agent run unattended. Read both before you flip any "skip confirmation" flag in production

*What to focus on:* which actions are reversible, which require human approval, and how to make sure the model never sees the credential it uses

4. Monitoring and drift

100% trace sampling at low scale; downsample to 1–10% with stratified sampling on errors at high scale

Alerts on: token cost per request, tool-call failure rate, LLM-as-judge mean score (nightly), p95 latency, eval regression

Re-baseline evals after every model upgrade

Anthropic's own engineering blog warns: "harnesses encode assumptions about what Claude can't do on its own; those assumptions go stale as models improve" (the "context anxiety" example with Sonnet 4.5 → Opus 4.5)

*What to focus on:* what to alert on vs what to log, how to detect prompt-cache invalidation, and how to spot harness ossification when the model has moved past it

5. Resilience

Durable execution (Inngest, Temporal, or LangGraph PostgresSaver) is non-negotiable for any agent that runs over 60 seconds

Checkpoint after every node. Rewind and fork should be possible. Pydantic Deep Agents and LangGraph both support this. The Claude Agent SDK's session log is equivalent

Resources:

  • How My Agents Self-Heal in Production (LangChain) (free) — A working pipeline that detects regressions after every deploy, triages the cause, and opens a fix PR with no human in the loop until review. Steal the pattern

*What to focus on:* which kinds of failures you can recover from automatically, which need human escalation, and how to test your resume path before production traffic forces the issue

Phase 5 Milestone

This phase doesn't end. But you should have:

  • Prompt caching wired across system prompt, CLAUDE.md, and tool definitions
  • A model-routing layer with hard cost-per-task budgets and alerting
  • A sandbox for all code execution and a credential broker keeping secrets out of context
  • Hooks blocking destructive actions and forcing human approval on irreversible ones
  • Trace sampling, drift alerts, and a re-baselining ritual on every model upgrade
  • A durable-execution layer so a process kill is a non-event

Recommendations

*Decision-ready takes you can act on today*

If you only learn one framework: LangGraph 1.0 + Deep Agents

It's the most general one, and its runtime story is the most mature today (PostgresSaver, time-travel debugging, durable execution, OTEL-friendly observability via LangSmith), it's model-agnostic, and the abstraction (state graph plus middleware) is a generalizable mental model

Full stop

If you only learn one harness as a reference: Claude Agent SDK plus Claude Code

It is the reference example. CLAUDE.md, Skills, sub-agents, hooks, plan mode, the filesystem-as-memory pattern. Every other harness in 2026 is converging on these primitives

Use Claude Code daily, read its docs, study the open-source harness compendiums

If you only read one thing on context

Anthropic's "Effective context engineering for AI agents" (Sep 2025)

If you only read two: add LangChain's "Context Engineering for Agents" for the Write/Select/Compress/Isolate framework

If you only learn one observability tool

LangSmith if you're staying on LangGraph

Braintrust if you want framework-agnostic CI gating

Inspect if you want benchmark-grade rigor (and you should, eventually)

Skip in 2026

AutoGen v0.4 (merged into Microsoft Agent Framework, community lineage is AG2. Neither is a strong default)

OpenAI Swarm (officially superseded, explicitly "not production-ready" per OpenAI's own README)

The Assistants API (sunsetting mid-2026)

Building your own vector store or memory before you've measured an actual recall problem

"No-code" agent platforms unless you're building something throwaway

Use only when you have a specific reason

CrewAI. Fastest idea-to-prototype, fragile in production. Use for hackathons and demos

OpenAI Agents SDK. Fine if you're OpenAI-locked. The April 2026 update added sandboxing and harness, but you're still tied to OpenAI models

Pydantic AI / Pydantic Deep Agents. Pick if you're a strict-types FastAPI shop

Mastra. Pick only if your team is TypeScript and can't use Python. v1.0 Jan 2026, YC W25, 22k+ stars, built by the Gatsby team

Smolagents. Best teaching tool for code-agent patterns (its 1,000-line codebase is hackable). Production-weak

DSPy 3.0 + GEPA. When you have a metric and want to programmatically optimize prompts and agent topology. GEPA outperforms RL by 6% with 35× fewer rollouts (ICLR 2026 oral)

Letta / MemGPT. If you need OS-style agent self-managed memory across sessions. Otherwise filesystem plus Mem0 is simpler

Benchmarks to bookmark (May 2026 numbers)

SWE-bench Verified: Claude Opus 4.7 ≈ 87.6%, GPT-5.5 ≈ 88.7%, Gemini 3.1 Pro ≈ 78.8%

Terminal-Bench 2.0: GPT-5.5 82.7%, Opus 4.7 ~70%, Gemini 3.1 Pro ~68%

τ-bench: Claude Mythos Preview 89.2% leads

BrowseComp: GPT-5.5 90.1%, Gemini 3.1 Pro 85.9%, Opus 4.7 79.3% (a regression from 4.6's 83.7%. Route web research to GPT-5.5)

GAIA / Princeton HAL: Sonnet 4.5 leads at 74.6%

Time-boxed milestones for a technically strong engineer new to agents

Week 2: Phase 0 done. You can explain a harness in plain English

Week 5: Phase 1 done. Claude Agent SDK agent shipped with one Skill, one hook, one sub-agent

Week 9: Phase 2 done. LangGraph deep-agent research analyst running with PostgresSaver durability and LangSmith traces

Week 13: Phase 3 done. 1,500-line mini-harness, written and documented, comparable in capabilities to a stripped Claude Agent SDK

Week 17: Phase 4 done. Golden datasets, CI gates, one published-benchmark run via Inspect

Forever: Phase 5

If you're moonlighting at 10–15 hours/week, multiply by ~2.5×

The benchmarks that change the plan: if you can't get Phase 1 working in 3 weeks, your tool design is wrong (re-read "Writing tools for agents"). If Phase 2 takes more than 5 weeks, you're trying to build the harness too. Drop down to Deep Agents and stop fighting it

Caveats

*Things that will trip you up if you don't see them coming*

Benchmarks are moving targets and partially gamed

SWE-bench Verified scores went from 1.96% to 80%+ in two years

τ-bench's pass^k consistency metric was added precisely because single-run accuracy stopped being informative

Treat any "X model scored Y%" claim as joint with the harness, the scaffold, the retry budget, and the system prompt. Not the model alone

Multi-agent is overhyped for most use cases

The 90.2% improvement Anthropic reported is for breadth-first research specifically

For coding and tightly coupled tasks, multi-agent often performs worse than single-agent and burns 15× the tokens

Default to single-agent plus sub-agents for scoped exploration. Reach for full multi-agent only when the task decomposes naturally

The counter-example to bookmark: Anthropic's "Building a C compiler with a team of parallel Claudes" (Feb 05, 2026) at https://anthropic.com/engineering/building-c-compiler shows a coding task where parallel sub-agents did pay off. Multi-agent isn't dead for code, it just needs the right decomposition

Speculation flags in 2026 sources

Several "AI 2027" projections (OpenBrain $45B revenue, etc.) are explicitly fictional but get cited as stats. Ignore them

Launch-week reception articles are anecdotal. Treat them as signal about developer sentiment, not as benchmarks

The framework landscape can shift again

LangChain's own framing has moved twice in 18 months (chains → graphs → harnesses-on-graphs)

Any of Pydantic AI, Mastra, or Deep Agents could be much bigger in 12 months

Bet on the abstractions (loop, tools, context, sub-agents, durability, traces) more than any one library. Those carry over

MCP's production rough edges are real

Streamable HTTP behind load balancers, multi-tenant auth, rate limiting, audit logging. All are explicitly on the 2026 MCP roadmap, meaning they're not solved yet

Plan for the next-gen transport SEPs landing in late 2026 and don't deeply couple to the current session model

Model-specific behavior changes between point releases

Opus 4.7's stricter instruction-following and new tokenizer mean your Opus 4.6 prompts may behave differently and cost up to 35% more in tokens for the same text

Re-replay traffic on every model bump

Your eval suite will rot

A golden dataset built today will saturate within months as models improve

Plan to grow it 10–20% per quarter from production failures, not from synthetic data

Keep human calibration on LLM-as-judge running indefinitely

Some sources in this roadmap are vendor-marketed

Lean on primary sources (Anthropic engineering blog, LangChain blog, OpenAI announcements, arXiv) where possible

The ranking-style "best of 2026" posts (Alice Labs, Channel.tel, GuruSup, Morph, Vstorm) are useful triangulation but each has commercial incentives

Where they agree with each other and with primary engineering sources, treat the consensus as reliable

Conclusion

What you can expect after working through this roadmap???

I'm going to be honest with you, without any sugar

This roadmap will not make you a principal AI engineer in 17 weeks

But it will make you someone who can build and ship agent systems that survive production traffic

That happens to be the thing companies are paying for right now

The demand for engineers who can ship production agents is not slowing down

57% of teams in the LangChain State of Agent Engineering report already have agents in production, and 89% of those have observability wired

Quality is the #1 barrier (32%), which means the entire field is bottlenecked on engineers who can build evals and harnesses, not on engineers who can call an LLM API

Anthropic's own number captures the real opportunity: same model, different harness, 78% vs 42% on CORE

That gap is your job

The harness-engineering shift is the largest mispricing in software hiring right now

Companies still post "prompt engineer" roles

What they need is engineers who can take a frontier model and turn it into a production system that is measurable and durable

Now here is what I want you to take away from all of this:

  • Pick one project from each phase and build it. Not read about it. Build it, break it, fix it, deploy it, then put a LangSmith trace and a benchmark score in your README. The engineers who get hired are the ones who can show a trace, not the ones who can recite a framework comparison table
  • Start sharing what you learn. Write up your mini-harness post-mortem. Publish your golden-dataset findings. Post your benchmark numbers with the harness configuration that produced them. Teaching is the fastest way to learn and it builds your reputation at the same time. The best opportunities come from engineers who are visible, not from engineers who applied to 500 listings
  • And please don't wait until you feel ready. You will never feel ready. The gap between "I'm reading the LangChain blog" and "I'm shipping a deep agent with PostgresSaver durability" is where most engineers get stuck forever
  • Start applying, start building in public, start shipping the moment you have a working agent. Even if it's small. The market doesn't reward perfection. It rewards engineers who can make the model do something real and prove it didn't regress

17 weeks is enough to change everything if you put in the work

And I believe each of you reading this can do it

Just keep building and keep measuring what you build

*This article is written by the author on his internal notes and notes compiled over 2 to 3 months, and it is edited by Minimax 2.7

I have a content pipeline on Obsidian which powers these and writes according to my style using my handwritten and hand-typed notes. *