Recorded on 23 April 2026. The AI agent world is moving weekly, so treat this as a snapshot of my thinking at that point in time.
The short version: modern coding agents are not just “a model plus a prompt.” The interesting system is everything around the model: the harness, the context, the tools, the permissions, the memory, the loop, and the review process.
That is what I mean by agentic design.
The New System Design Question
A classic infrastructure interview question is:
What happens when you type google.com into a browser and press enter?
The agentic equivalent is now:
What happens when you type a prompt into a coding agent?
That question forces you to inspect the full stack:
- which model is selected
- what context is loaded
- what files and tools the agent can see
- how the plan is represented
- what the harness allows or denies
- how tool calls are executed
- how the loop decides to continue, retry, or stop
- what memory is updated after the run
- where human review enters the system
The model matters. But the model is no longer the whole product.
The Frontier Moved Outward
Most early agent conversation treated prompt quality as the center of the system. That was reasonable when prompts, context windows, and raw model quality were the primary bottlenecks.
But once frontier models became good enough at longer-horizon work, the bottleneck moved outward.
The visible quality now comes from runtime design:
- context packing
- tool selection
- filesystem access
- terminal access
- permission gates
- stop conditions
- retry policy
- memory policy
- review workflow
This is why two tools using similar models can feel completely different. A better harness can look like a better model.
Prompt as a Subsystem
The old mental model was:
user prompt -> model answer
The practical model is closer to:
user intent
-> context loader
-> planning step
-> tool policy
-> filesystem / terminal / browser / APIs
-> execution loop
-> checkpoints
-> human review
-> memory update
-> next run
The prompt is still important. But it is one input to a larger machine.
If you want to understand better agents, inspect the machine.
Planning and Execution Are Different Jobs
One pattern I keep returning to is separating planning from execution.
For serious work, I like using stronger frontier models to generate plans. Give the same high-context prompt to multiple models, let them ask clarifying questions, and compare their plans. Then synthesize the useful parts into a checklist.
Once the plan is concrete enough, execution can often be handed to a faster model, a coding agent, or even a deterministic script.
The shape is:
- use strong models for taste, decomposition, and risk discovery
- convert the plan into a checklist
- execute against the checklist
- stop early when the agent diverges
- keep human review at the right boundaries
For low-stakes side projects, you can explore looser loops. For production or reputation-sensitive work, the review gates matter.
The useful split is three layers of control:
- model: reasoning, planning, synthesis, taste
- harness: tools, permissions, loop logic, execution surface
- context: files, memories, skills, plans, prior decisions
The pattern I like is to let strong models battle over the plan, then hand the selected plan to a stricter executor. The executor should not be “vibing.” It should be following a checklist.
The Small Loop Pattern
A small loop captures the shift:
while :; do cat PROMPT.md | claude; done
That looks almost stupid. But add a persistent prompt, a checklist, a filesystem, a stop condition, and a review rule, and it becomes a real workflow.
Then the engineering questions appear:
- what does completion mean?
- what context should be refreshed?
- what should be pruned?
- when should the loop stop?
- what is safe for the model to execute?
- how do you detect drift?
- what state survives into the next run?
The loop is trivial. The surrounding runtime is the system.
I called this kind of wrapper the Ralph Wiggum loop: keep handing the agent the checklist, make it do the next thing, and stop only when a clear exit condition is hit. It is silly-looking engineering, but it works because it converts “one big agent run” into a repeated control loop.
Harnesses Are the Product Surface
A useful harness usually includes:
- filesystem access
- terminal access
- tool calling
- permission policies
- context loading
- workflow templates
- memory boundaries
- background or scheduled execution
- feedback, retry, and backoff
- checkpoints and review gates
This is also where safety belongs.
Safety is not “tell the model to be careful.” Safety is enforced outside the model:
- read-only modes
- restricted tools
- sandboxing
- network boundaries
- permission prompts
- branch/checkpoint workflows
- human approval before destructive actions
The model can follow instructions better than before, but the safer default is still harness-level enforcement.
There is also a philosophical spectrum here. On one end is the slow, controlled, review-heavy style: tight permissions, careful diffs, small steps. On the other end is the token-billionaire style: broad autonomy, relaxed permissions, let the agent explore. Both modes have a place. The mistake is not knowing which mode you are in.
If production data, money, reputation, or irreversible state is involved, I want the boring harness-level controls.
Tools, Skills, MCPs, and Hooks
Reusable behavior should not live only in ad-hoc prompt blobs.
The useful surfaces are becoming more explicit:
SKILL.mdfilesCLAUDE.md/AGENTS.md-style project instructions- MCP servers
- hooks
- slash commands
- custom scripts
- small local tools
SKILL.md is interesting because it packages procedure as context. It lets the agent load the right operational knowledge when a task appears.
MCPs are interesting because they turn external systems into agent-usable tools. A lot of enterprise APIs are unpleasant for humans; an MCP wrapper can make them tolerable for agents.
But MCPs are not always the best local interface. They can also pollute context if the tool surface is too large. For local work, a small SKILL.md, a direct CLI, or a tiny script can be more effective than a large protocol wrapper.
This is also why “return to primitives” keeps showing up. Modern agents often do well with boring terminal primitives: grep, find, python, jq, git, curl, small scripts. A clean primitive with clear output can beat a fancy integration with too much context.
Hooks are interesting because they move repeated rules out of the model and into the runtime. Instead of repeatedly saying “remember to run tests,” the harness can run tests. Instead of asking the model to update docs every time, a hook can enforce that flow.
The general rule: if something is deterministic, prefer a script or hook over another paragraph of instruction.
Memory Is Not Just Retrieval
For a long time, AI memory meant embeddings plus retrieval. That is too narrow.
In agent systems, memory includes:
- durable project instructions
- user preferences
- prior decisions
- tool quirks
- failure notes
- summaries of previous runs
- source-of-truth documents
- explicit “do not repeat this mistake” rules
The hard part is not storing more. The hard part is deciding what deserves to survive.
Bad memory becomes stale-rule debt. Good memory removes repeated steering.
Personal agents make this more important. If an agent is always available through WhatsApp, Telegram, iMessage, a terminal, or a browser, memory becomes part of the product. But it needs curation. Otherwise the agent slowly fills itself with slop.
The shape I like more than generic RAG is an LLM wiki: a hierarchy of markdown files that the agent can read, maintain, summarize, and improve over time. As context windows grow, the bottleneck is less “can I retrieve five chunks?” and more “what is the authoritative structure of what this agent knows?”
Agent Boards and Multi-Agent Workflows
Multi-agent tools are starting to look less like chat and more like project boards.
The pattern is:
- split work into tickets
- assign agents to independent tasks
- run some tasks in parallel
- serialize dependent tasks
- merge outputs through review
- keep a human in the loop for taste and acceptance
Tools like Paperclip, Mission Control, and Symphony point in this direction: agent work allocated across boards, roles, and dependency graphs.
The risk is obvious: this can produce unprecedented amounts of AI slop.
The hard part is not spawning more agents. The hard part is defining ownership, convergence, artifacts, and review gates.
Parallelism Needs Dependency Design
Multi-agent work starts to look like infrastructure planning.
Some tasks can run independently. Some must wait. Some require human approval. Some should never run automatically.
It is similar to Terraform in spirit: independent resources can apply in parallel, dependent resources must serialize, and dangerous changes need a plan/review step.
Agent boards need the same discipline.
Without dependency design, parallelism becomes chaos.
From Prompt Templates to Just-in-Time Software
Prompt templates and slash commands are useful. They encode repeated workflows and help non-technical users get consistent behavior.
But there is a better pattern when the task is repeatable:
Ask the agent to build software it can use later.
A script, extension, hook, or CLI can beat a long instruction file because it makes the behavior deterministic. This is especially true for repeated operations: parsing, formatting, checking, publishing, scraping, moving files, generating reports, running tests.
The best agentic workflow is not always “write a better prompt.” Often it is:
generate a small tool, then use the tool.
That is just-in-time software.
This is one of the strongest patterns in the whole talk. Do not keep adding prompt mass forever. If the agent repeatedly needs to do something precise, ask it to write a deterministic tool. Then future runs can call the tool instead of reinterpreting the instruction.
Auto-Improving Context
The next layer is self-improvement loops.
A simple version:
- inspect previous agent sessions
- find repeated failures
- propose updates to
CLAUDE.md,AGENTS.md, orSKILL.md - human reviews the changes
- future runs improve
This is where agent systems start to feel like they are learning operationally, not because the model weights changed, but because the surrounding context and tooling improved.
Karpathy’s auto-research framing points in the same direction: define an objective, run experiments, update the loop by evidence.
The harness improves even when the model stays the same.
Always-on personal agents push this further. A background agent can run scheduled research, inspect previous conversations, draft memory updates, and improve its own operating context. I think of this as memory dreaming: the agent reviews what happened, proposes what should be remembered, and the human accepts or rejects it.
The human review part matters. Autonomous memory without curation becomes instruction garbage collection in reverse.
Why Dismissing Tools Too Early Is Dangerous
One thing my mentor pulled out in the conversation is important: it is easy to dismiss new AI tools as fad, slop, or unnecessary.
Sometimes that dismissal is correct.
But if the opinion comes from shallow glossing over instead of actual trial, you may miss the good part of the tool.
The better posture is empirical:
- try it seriously
- inspect what it changes
- find the failure modes
- keep the useful primitive
- discard the hype
Keeping up with the agentic toolchain is almost a full-time job. But if you build software for a living, the primitives are becoming too important to ignore.
Also, because the conversation got informal by the end: yes, this includes the important cultural work of appreciating RAG Against the Machine and Seedhe Maut. Some primitives are technical. Some are motivational infrastructure.
What Engineers Are Actually Engineering Now
The engineering target is shifting.
We are not only engineering application code. We are engineering:
- workflows
- flowcharts
- harnesses
- permissions
- control loops
- memory systems
- review gates
- background jobs
- agent-to-agent handoffs
- context maintenance
That is the real frontier of agentic design.
The frontier is no longer only model frontier.
It is the design frontier:
- can the loop preserve context?
- can it protect itself?
- can it recover?
- can it improve across runs?
- can it produce useful artifacts without drowning the human in slop?
The prompt matters, but only inside that architecture.