Six tiers of automation, animated by tiny pixel agents. Click a rung to watch how the work flows. The further you climb, the more autonomy you hand over — and the more eval surface you take on.
Click a rung. Watch the pixel agents act it out. Every step up adds power and adds failure modes — the rule is to start at the bottom and only climb when the data forces you to.
Same task: "respond to a customer asking about their order." On the left, you wrote the steps and the LLM fills in the blanks. On the right, the agent is given a goal + tools and figures the path out itself. Watch them go.
Customer support is the canonical case. Internal users forgive a wrong answer; external customers don't. Toggle the audience and watch the workflow's checkpoints rearrange themselves.
Every automation that survives contact with reality has the same four-stage rhythm. Agents move around the ring. Failures from the field become tomorrow's eval cases.
Map the workflow. Score each task by volume × time × error cost. Pick the highest-leverage 1–2.
Hand-build v0 in Claude. Run 20–50 real examples. Write the eval rubric BEFORE building.
Smallest scope that ships value. Ship behind a flag or HITL. Log everything: success rate, latency, cost, override rate.
Weekly: read failures, add to eval set, iterate prompt + tools, re-run evals. Loosen autonomy only when metrics earn it.
Karpathy's framing: programming has shifted from writing code to writing prompts. The context window is your RAM. The model is the CPU. Your specs are the program. Here's what changes for PMs — and what to actually do about it.
Each era keeps the previous one — you still write code, you still train models. But the leading edge of leverage moves up the stack.
Explicit rules. The human writes every branch. Predictable, debuggable, brittle when the world doesn't fit the rules.
Neural networks. You curate datasets and let gradient descent find the program for you. The "code" is now millions of weights.
The LLM is the computer. You program it in English. The context window is RAM. Spec quality, not syntax, is the bottleneck.
Two different speeds. Both are useful. Knowing which one you're in is the PM job.
Models peak where output is verifiable (math, code) because that's where RL has been pointed. They cliff in places that look "obvious" to humans. Treat capability as terrain, not altitude.
Don't blindly trust the output. A model that just refactored a 100k-line codebase may also fail to decide whether to walk or drive 50 metres to a car wash. Stay in the loop. Eval the domain, not the model.
As intelligence gets cheaper, the premium moves to taste, judgment, and oversight. The PMs who win in Software 3.0 are the ones who can write a sharp spec and review agent output ruthlessly.
Outsource the thinking — data crunching, code synthesis, draft generation. Don't outsource the understanding. Know your business, know your users, know what "good" looks like. The agents fill in the API details. You decide what's worth building and whether it works.
Concrete moves Karpathy calls out. Steal them.
Stop testing syntax puzzles. Hand candidates a real-scale spec — "build a secure social-media clone" — and watch how they decompose the problem and orchestrate agents to ship it.
Your role is Director. Spend your time on engineering design, architecture decisions, and detailed specs. Let the agents fill in API details, boilerplate, and tests.
You can outsource the thinking — data crunching, code synthesis. You can't outsource the understanding. Go deep on the business, the users, and the technical fundamentals you're directing.