Interactive tour // v0.1

Pick the lowest rung that gets the job done.

Six tiers of automation, animated by tiny pixel agents. Click a rung to watch how the work flows. The further you climb, the more autonomy you hand over — and the more eval surface you take on.

RUNGS 06

SCENARIOS 02

AGENTS 05

FOR RAHUL_S

User

Agent

Tool · DB

Reviewer

01 // The Ladder

Six rungs from "if-then" to autonomous agents.

Click a rung. Watch the pixel agents act it out. Every step up adds power and adds failure modes — the rule is to start at the bottom and only climb when the data forces you to.

STAGE_01 / RUNG_01.SCN LIVE

User

Script

Output

User

LLM

User

LLM

Tool · CRM

User

LLM · Drafts

Human · Approves

Send

User

Agent · Plans

Calc

Done

↻

User

Orchestrator

Researcher

Writer

Brief

USE WHEN
Rules are stable, inputs are structured

SLACK

STACK WHY THIS RUNG

02 // LLM vs Agent

One picks the steps. The other decides them.

Same task: "respond to a customer asking about their order." On the left, you wrote the steps and the LLM fills in the blanks. On the right, the agent is given a goal + tools and figures the path out itself. Watch them go.

LLM Workflow

FIXED PATH · 1 STEP

TASK "Tag this support email with category, urgency, and language."

User

LLM

You define: read message → draft reply
One round trip. Predictable cost.
Works when path is knowable.
Fails when input is unbounded.

Real example · Notion AI summarize. One prompt: "summarize this page in 3 bullets." No tools, no loops. Round-trip cost: ~$0.01. Ships in a day.

Agent Workflow

OPEN PATH · N STEPS

TASK "Resolve this customer's refund request end-to-end. Look up the order, check policy, decide eligibility, draft the reply, escalate if > $500."

User

Agent

CRM

Orders

Refund

↻

You define: goal + toolbox + when to stop
N round trips. Variable cost.
Works when path is unknowable.
Fails open without strict eval coverage.

Real example · Cursor "fix this bug" agent. Given a failing test, it reads files, edits code, re-runs the test, iterates until green. Path unknowable in advance — can't write it as steps.

Rule of thumb: start at LLM. Climb to agent only when you can prove an LLM can't get there.

03 // Internal vs External

Same workflow, very different error tolerance.

Customer support is the canonical case. Internal users forgive a wrong answer; external customers don't. Toggle the audience and watch the workflow's checkpoints rearrange themselves.

STAGE_03 / INTERNAL.SCN LIVE

Employee

Agent

Knowledge Base

Reviewer

Audience

Internal Employee

Error Tolerance

High

A wrong answer is annoying. The user can DM you for a correction.

Recommended Rung

L3 · Agentic loop

Let the agent search, reason, and answer end-to-end. Log everything.

Mandatory Guardrails

Logging + weekly review

Cheap to ship, cheap to fix. Iterate from real failure logs.

Real-world example

Notion HR helpdesk

An employee asks "how many vacation days do I have left?" The agent queries the HRIS, replies in Slack. A wrong answer is a quick correction in DM. Worth automating end-to-end.

04 // The Process Loop

You don't ship and walk away. You close the loop.

Every automation that survives contact with reality has the same four-stage rhythm. Agents move around the ring. Failures from the field become tomorrow's eval cases.

STAGE_04 / FEEDBACK_LOOP.SCN LIVE

Discover

Validate

Implement

Review

The loop is the product. Failures get logged, fed back into the eval set, and the system learns next cycle.

STAGE 01

Discover

Map the workflow. Score each task by volume × time × error cost. Pick the highest-leverage 1–2.

Example AP team mapped 6 invoice-processing tasks. Picked extraction first: 200 invoices/day × 8 min each = 26 hrs/week of ops time. Structured input, low error cost early. Skipped GL coding (judgment-heavy, big error cost) for v2.

STAGE 02

Validate

Hand-build v0 in Claude. Run 20–50 real examples. Write the eval rubric BEFORE building.

Example Built v0 in Claude.ai with 50 real invoices. Rubric: extract date, amount, vendor, line items. v0 hit 92% on clean PDFs, 60% on scans. Decision: ship clean-PDF flow, queue scans for v2.

STAGE 03

Implement

Smallest scope that ships value. Ship behind a flag or HITL. Log everything: success rate, latency, cost, override rate.

Example Shipped behind a flag for the AP team. HITL for 2 weeks — every extraction reviewed before posting to NetSuite. Logged: success rate, latency, cost, override rate, per-vendor accuracy.

STAGE 04

Review

Weekly: read failures, add to eval set, iterate prompt + tools, re-run evals. Loosen autonomy only when metrics earn it.

Example Override rate started at 12%. Weekly review found 3 failure patterns: handwritten amounts, multi-currency, new vendor names. Added 50 cases to eval set. Tightened prompt + added a vendor lookup tool. Override fell to 1.5% by week 6 — then we removed HITL for known vendors.

05 // Software 3.0

The LLM is the computer. Your job is to direct it.

Karpathy's framing: programming has shifted from writing code to writing prompts. The context window is your RAM. The model is the CPU. Your specs are the program. Here's what changes for PMs — and what to actually do about it.

Three eras, one direction of travel

Each era keeps the previous one — you still write code, you still train models. But the leading edge of leverage moves up the stack.

Software 1.0

1950s · ONGOING

Coder

Code

CPU

Explicit rules. The human writes every branch. Predictable, debuggable, brittle when the world doesn't fit the rules.

ProgrammerHuman

ProgramCode

ComputerCPU

Software 2.0

2012 · ONGOING

Data

NN · Weights

Predict

Neural networks. You curate datasets and let gradient descent find the program for you. The "code" is now millions of weights.

ProgrammerDatasets

ProgramWeights

ComputerGPU

Software 3.0

2023 · NOW

Spec

CONTEXT.RAM

Output

The LLM is the computer. You program it in English. The context window is RAM. Spec quality, not syntax, is the bottleneck.

ProgrammerPM / Director

ProgramPrompt + Spec

ComputerLLM

Vibe coding vs agentic engineering

Two different speeds. Both are useful. Knowing which one you're in is the PM job.

Vibe Coding

SPEED · PROTOTYPE · ANYONE

WHEN You want to test if an idea is even worth building. Cost of being wrong is low.

Describe the vibe → model ships v0 in minutes
Anyone can do it. No framework knowledge needed
Great for prototypes, demos, side projects
Don't ship to paying customers without a rewrite

Real example · Karpathy's MenuGen. He built an OCR pipeline for restaurant menus. Months later, one prompt to Gemini replaced the entire stack — overlaying generated images directly onto the menu pixels. The original code became dead weight.

Agentic Engineering

QUALITY · PRODUCTION · DISCIPLINE

WHEN The output ships. Brand, security, money or compliance is on the line.

Coordinate stochastic agents under a quality bar
Evals, observability, version control, code review
Output is reviewed, tested, gradually trusted
The new "10x engineer" is fluent at this

The new 10x. Engineers who master agentic tools (multi-agent orchestration, eval harnesses, prompt versioning) are seeing productivity gains far past the historical 10x ceiling. The floor moved up; so did the ceiling.

Rule: vibe-code to learn, agentic-engineer to ship. Don't confuse the two.

Jagged intelligence — same model, wildly uneven

Models peak where output is verifiable (math, code) because that's where RL has been pointed. They cliff in places that look "obvious" to humans. Treat capability as terrain, not altitude.

^_^

Refactor 100k LOC

^_^

Math proofs

^_^

SQL queries

o_o

Long-form writing

o_o

Tool selection

x_x

Spatial reasoning

x_x

Common-sense logic

Don't blindly trust the output. A model that just refactored a 100k-line codebase may also fail to decide whether to walk or drive 50 metres to a car wash. Stay in the loop. Eval the domain, not the model.

"An agent tried to link user accounts using emails from different services rather than a persistent user ID — a classic intern-level mistake that needed human engineering judgment to catch."

Your role: Director, not Doer

As intelligence gets cheaper, the premium moves to taste, judgment, and oversight. The PMs who win in Software 3.0 are the ones who can write a sharp spec and review agent output ruthlessly.

You're not writing tickets.
You're directing agents.

Outsource the thinking — data crunching, code synthesis, draft generation. Don't outsource the understanding. Know your business, know your users, know what "good" looks like. The agents fill in the API details. You decide what's worth building and whether it works.

Taste · what's worth building

Judgment · when output is wrong

Oversight · evals, guardrails, escalation

Three things to change this quarter

Concrete moves Karpathy calls out. Steal them.

Refactor hiring

Stop testing syntax puzzles. Hand candidates a real-scale spec — "build a secure social-media clone" — and watch how they decompose the problem and orchestrate agents to ship it.

SIGNAL Agentic fluency · spec quality · review reflex

Focus on the spec

Your role is Director. Spend your time on engineering design, architecture decisions, and detailed specs. Let the agents fill in API details, boilerplate, and tests.

YOUR OUTPUT Tight specs · clear interfaces · explicit constraints

Invest in understanding

You can outsource the thinking — data crunching, code synthesis. You can't outsource the understanding. Go deep on the business, the users, and the technical fundamentals you're directing.

DON'T OUTSOURCE Why · what good looks like · the goal

Six rungs from "if-then" to autonomous agents.

One picks the steps. The other decides them.

LLM Workflow

Agent Workflow

Same workflow, very different error tolerance.

You don't ship and walk away. You close the loop.

Discover

Validate

Implement

Review

The LLM is the computer. Your job is to direct it.

Three eras, one direction of travel

Software 1.0

Software 2.0

Software 3.0

Vibe coding vs agentic engineering

Vibe Coding

Agentic Engineering

Jagged intelligence — same model, wildly uneven

Your role: Director, not Doer

You're not writing tickets.You're directing agents.

Three things to change this quarter

Refactor hiring

Focus on the spec

Invest in understanding

You're not writing tickets.
You're directing agents.