Xrticles │ Karpathy's 4 CLAUDE.md rules cut Claude mistakes from 41% to 11%. After 30 codebases, I added 8 more

In late January 2026, Andrej Karpathy posted a thread complaining about how Claude writes code. Three failure modes: silent wrong assumptions, over-complication, orthogonal damage to code it shouldn't have touched.

Forrest Chang read the thread, packaged the complaints into 4 behavioral rules in a single CLAUDE.md file, and dropped it on GitHub. It hit 5,828 stars in the first day. 60,000 bookmarks in two weeks. 120,000 stars today. The fastest-growing single-file repo of 2026.

Then I tested it on 30 codebases over 6 weeks.

The 4 rules work. Mistakes that used to happen ~40% of the time dropped to under 3% on tasks that played to their strengths. But the template was built to fix code-writing mistakes from January.

The Claude Code ecosystem in May 2026 has different problems — agent fights, hook cascades, skill loading conflicts, multi-step workflows that break across sessions.

So I added 8 more rules. Below: the full 12-rule CLAUDE.md, why each one earned its place, and the 4 places where the original Karpathy template silently breaks.

If you want to skip the explanations and just paste, the full file is at the end.

Why this matters

Claude Code's CLAUDE.md is the most under-leveraged file in the entire AI coding stack. Most developers either:

Treat it as a dump for every preference they've ever had, bloated to 4,000+ tokens, compliance drops to 30%
Skip it entirely and prompt every time — 5x token waste, no consistency between sessions
Copy a template once and forget. Works for two weeks, then breaks silently as their codebase shifts

The official Anthropic docs are explicit: CLAUDE.md is advisory. Claude follows it about 80% of the time. Past 200 lines, compliance drops sharply because important rules get buried in the noise.

Karpathy's template solved this in one file, 65 lines, 4 rules. That's the floor.

The ceiling is higher. With 8 more rules, the ones I'll go through below, you cover not just the January 2026 code-writing problems Karpathy complained about, but the May 2026 agent-orchestration problems that didn't exist yet when the template was written.

The original 4 rules

If you haven't read Forrest Chang's repo, the floor:

Rule 1 — Think Before Coding. No silent assumptions. State what you're assuming. Surface tradeoffs. Ask before guessing. Push back when a simpler approach exists.

Rule 2 — Simplicity First. Minimum code that solves the problem. No speculative features. No abstractions for single-use code. If a senior engineer would call it overcomplicated — simplify.

Rule 3 — Surgical Changes. Touch only what you must. Don't "improve" adjacent code, comments, or formatting. Don't refactor what isn't broken. Match existing style.

Rule 4 — Goal-Driven Execution. Define success criteria. Loop until verified. Don't tell Claude what steps to follow, tell it what success looks like and let it iterate.

These four close ~40% of the failure modes I see in unsupervised Claude Code sessions. The remaining ~60% live in the gaps below.

The 8 rules I added (and why)

Each of these came from a real moment where Karpathy's 4 weren't enough. I'll show the moment, then the rule.

Rule 5 — Don't make the model do non-language work

Karpathy's rules are silent on this. The model decides things that should be deterministic code, whether to retry an API call, how to route a message, when to escalate. Different decisions every week. Flaky if-else at $0.003 per token.

The moment: Code that called Claude to "decide if we should retry on 503" worked beautifully for two weeks, then started flaking because the model started reading the request body as context for the decision. The retry policy was random because the prompt was random.

Rule 6 — Hard token budgets, no exceptions

CLAUDE.md without budgets is a blank check. Every loop has a chance to spiral into a 50,000-token context dump. The model won't stop on its own.

The moment: A debugging session ran for 90 minutes. The model was perfectly happy iterating on the same 8KB error message, gradually losing track of which fix it had already tried. By the end, it was suggesting fixes I'd rejected 40 messages earlier. Token budget would have killed it at minute 12.

Rule 7 — Surface conflicts, don't average them

When two parts of the codebase disagree, Claude tries to please both. The result is incoherent.

The moment: A codebase had two error-handling patterns — one async/await with explicit try/catch, one with a global error boundary. Claude wrote new code that did *both*. Doubled error handlers. Took me 30 minutes to figure out why errors were swallowed twice.

Rule 8 — Read before you write

Karpathy's "Surgical Changes" tells Claude not to touch adjacent code. It doesn't tell Claude to *understand* adjacent code first. Without this, Claude writes new code that conflicts with existing code 30 lines away.

The moment: Claude added a function next to an existing identical function it hadn't read. Both functions did the same thing. The new one took precedence because of import order. The old one had been the source of truth for 6 months.

Rule 9 — Tests are not optional, but they're not the goal

Karpathy's Goal-Driven Execution implies tests as success criteria. In practice, Claude treats "tests pass" as the *only* goal, and writes code that passes shallow tests while breaking everything else.

The moment: Claude wrote 12 tests for an auth function. All passed. Auth was broken in production. The tests were testing the function returned *something*, not whether it returned the right thing. The function passed because it was returning a constant.

Rule 10 — Long-running operations need checkpoints

Karpathy's template assumes one-shot interactions. Real Claude Code work is multi-step — refactoring across 20 files, building features over a session, debugging across multiple commits. Without checkpoints, one wrong turn loses all progress.

The moment: A 6-step refactor went wrong on step 4. By the time I noticed, Claude had also done step 5 and 6 on top of the broken state. Untangling took longer than redoing the whole thing. Checkpoints would have caught it at step 4.

Rule 11 — Convention beats novelty

In a codebase with established patterns, Claude likes to introduce its own. Even when its way is "better," the introduction of two patterns is worse than either pattern alone.

The moment: Claude introduced React hooks into a class-component codebase. They worked. They also broke the codebase's testing patterns, which assumed componentDidMount. Half a day to remove and rewrite.

Rule 12 — Fail visibly, not silently

The most expensive Claude failures are the ones that look like success. A function "works" but returns wrong data. A migration "completes" but skipped 30 records. A test "passes" but only because the assertion was wrong.

The moment: Claude said a database migration "completed successfully." It had silently skipped 14% of records that hit a constraint violation. The skip was logged but not surfaced. Discovered the problem 11 days later when reports started looking wrong.

The numbers

I tracked the same set of 50 representative tasks across 30 codebases for 6 weeks. Three configurations:

Mistake rate = task required correction or rewrite to match intent. Counts: silent wrong assumption, over-engineering, orthogonal damage, silent failure, convention violation, conflict averaging, missed checkpoint.

Compliance = how often Claude visibly applied the relevant rule when it was applicable.

The interesting result isn't the headline drop from 41% to 3%. It's that going from 4 rules to 12 added almost no compliance overhead (78% -> 76%) but cut the mistake rate by another 8 points. The new rules cover failure modes the original 4 didn't address — they don't compete for the same attention budget.

Where Karpathy's template silently breaks

Four places where the original 4-rule template is not enough, even before adding new rules:

1. Long-running agent tasks. Karpathy's rules target the moment Claude is *writing* code. They're silent on what happens when Claude is *running* a multi-step pipeline. No budget rule. No checkpoint rule. No "fail loud" rule. Pipelines drift.

2. Multi-codebase consistency. "Match existing style" assumes one style. In a monorepo with 12 services, Claude has to pick which style. The original rules don't tell it how. It picks randomly or averages.

3. Test quality. Goal-Driven Execution treats "tests pass" as success. Doesn't say tests have to be *meaningful*. The result is tests that test nothing useful but make Claude confident.

4. Production vs prototype. The same 4 rules that protect production code from over-engineering also slow down prototypes that legitimately need 100 lines of speculative scaffolding to figure out a direction. Karpathy's "Simplicity First" overfires on early-stage code.

The 8 added rules don't replace Karpathy's 4. They patch the gaps where his model, January 2026, autocomplete-style coding, doesn't match May 2026's agent-driven, multi-step, multi-codebase work.

What didn't work

Things I tried before settling on the 12 rules:

**Adding rules I'd seen on Reddit / X.

**Most were either restatements of Karpathy's 4 with different words, or domain-specific rules ("always use Tailwind classes") that don't generalize. Cut them.

**More than 12 rules.

**I tested up to 18. Compliance dropped from 76% to 52% past 14 rules. The 200-line ceiling is real. Past it, Claude starts pattern-matching to "rules exist" without actually reading them.

**Rules that depend on tooling that might not exist.

**"Always use eslint" breaks when eslint isn't installed. Rule fails silently. Replaced with capability-agnostic phrasings: "match the codebase's enforced style" instead of "use eslint."

**Examples in CLAUDE.md instead of rules.

**Examples are heavier than rules. Three examples cost as much context as ~10 rules and Claude over-fits to them. Rules are abstract, examples are specific. Use rules.

**"Be careful" / "think hard" / "really focus."

**Pure noise. Compliance with these dropped to ~30% because they're not testable. Replaced with concrete imperatives ("state assumptions explicitly").

**Telling Claude to be "senior."

**Doesn't work. Claude already thinks it's senior. The compliance gap is between thinking and doing. Imperative rules close the gap; identity prompts don't.

The full 12-rule CLAUDE.md (copy-paste ready)

Save as CLAUDE.md at your repo root. Add project-specific rules below the 12 (stack, test commands, error patterns). Don't go past 200 lines combined, past that, compliance falls off.

How to install

Two steps:

Save at your repo root. The >> matters, it appends to your existing CLAUDE.md instead of overwriting any project-specific rules you already have.

The mental model

CLAUDE.md is not a wishlist. It's a behavioral contract that closes specific failure modes you've observed.

Every rule should answer: *what mistake does this prevent?*

Karpathy's 4 prevent the failure modes he saw in January 2026: silent assumptions, over-engineering, orthogonal damage, weak success criteria. They're foundational. Don't skip them.

The 8 I added prevent failure modes that emerged in May 2026: agent loops without budgets, multi-step tasks without checkpoints, tests that don't test, silent successes hiding silent failures. They're additive.

Your mileage will vary. If you don't run multi-step pipelines, Rule 10 doesn't matter. If your codebase has one consistent style enforced by linting, Rule 11 is redundant. Read the 12, keep the ones that map to mistakes you've actually made, drop the rest.

A 6-rule CLAUDE.md tuned to your real failure modes beats a 12-rule one with 6 rules you'll never need.

T H E _ E N D

Karpathy's January 2026 thread was a complaint. Forrest Chang turned it into 4 rules. 120,000 developers starred the result. Most of them are still running 4 rules today.

The model has improved. The ecosystem has changed. Multi-step agents, hook cascades, skill loading, multi-codebase work — none of these existed when Karpathy wrote his thread. The 4 rules don't address them. They're not wrong; they're incomplete.

8 more rules. 6 weeks of testing across 30 codebases. Mistake rate from 41% to 3%.

*Bookmark this and paste the 12 rules into your CLAUDE.md tonight. Repost if it saved you a week of Claude wrong turns.

Telegram for daily claude optimization tips: https://t.me/+_ZWrQN7GuDA3ZDEy*

Karpathy's 4 CLAUDE.md rules cut Claude mistakes from 41% to 11%. After 30 codebases, I added 8 more