Frontiers in Agentic Design

Recorded on 23 April 2026. The AI agent world is moving weekly, so treat this as a snapshot of my thinking at that point in time.

The short version: modern coding agents are not just “a model plus a prompt.” The interesting system is everything around the model: the harness, the context, the tools, the permissions, the memory, the loop, and the review process.

That is what I mean by agentic design.

The New System Design Question

A classic infrastructure interview question is:

What happens when you type google.com into a browser and press enter?

The agentic equivalent is now:

What happens when you type a prompt into a coding agent?

That question forces you to inspect the full stack:

which model is selected
what context is loaded
what files and tools the agent can see
how the plan is represented
what the harness allows or denies
how tool calls are executed
how the loop decides to continue, retry, or stop
what memory is updated after the run
where human review enters the system

The model matters. But the model is no longer the whole product.

The Frontier Moved Outward

Most early agent conversation treated prompt quality as the center of the system. That was reasonable when prompts, context windows, and raw model quality were the primary bottlenecks.

But once frontier models became good enough at longer-horizon work, the bottleneck moved outward.

The visible quality now comes from runtime design:

context packing
tool selection
filesystem access
terminal access
permission gates
stop conditions
retry policy
memory policy
review workflow

This is why two tools using similar models can feel completely different. A better harness can look like a better model.

Prompt as a Subsystem

The old mental model was:

user prompt -> model answer

The practical model is closer to:

user intent
  -> context loader
  -> planning step
  -> tool policy
  -> filesystem / terminal / browser / APIs
  -> execution loop
  -> checkpoints
  -> human review
  -> memory update
  -> next run

The prompt is still important. But it is one input to a larger machine.

If you want to understand better agents, inspect the machine.

Planning and Execution Are Different Jobs

One pattern I keep returning to is separating planning from execution.

For serious work, I like using stronger frontier models to generate plans. Give the same high-context prompt to multiple models, let them ask clarifying questions, and compare their plans. Then synthesize the useful parts into a checklist.

Once the plan is concrete enough, execution can often be handed to a faster model, a coding agent, or even a deterministic script.

The shape is:

use strong models for taste, decomposition, and risk discovery
convert the plan into a checklist
execute against the checklist
stop early when the agent diverges
keep human review at the right boundaries

For low-stakes side projects, you can explore looser loops. For production or reputation-sensitive work, the review gates matter.

The useful split is three layers of control:

model: reasoning, planning, synthesis, taste
harness: tools, permissions, loop logic, execution surface
context: files, memories, skills, plans, prior decisions

The pattern I like is to let strong models battle over the plan, then hand the selected plan to a stricter executor. The executor should not be “vibing.” It should be following a checklist.

The Small Loop Pattern

A small loop captures the shift:

while :; do cat PROMPT.md | claude; done

That looks almost stupid. But add a persistent prompt, a checklist, a filesystem, a stop condition, and a review rule, and it becomes a real workflow.

Then the engineering questions appear:

what does completion mean?
what context should be refreshed?
what should be pruned?
when should the loop stop?
what is safe for the model to execute?
how do you detect drift?
what state survives into the next run?

The loop is trivial. The surrounding runtime is the system.

I called this kind of wrapper the Ralph Wiggum loop: keep handing the agent the checklist, make it do the next thing, and stop only when a clear exit condition is hit. It is silly-looking engineering, but it works because it converts “one big agent run” into a repeated control loop.

Harnesses Are the Product Surface

A useful harness usually includes:

filesystem access
terminal access
tool calling
permission policies
context loading
workflow templates
memory boundaries
background or scheduled execution
feedback, retry, and backoff
checkpoints and review gates

This is also where safety belongs.

Safety is not “tell the model to be careful.” Safety is enforced outside the model:

read-only modes
restricted tools
sandboxing
network boundaries
permission prompts
branch/checkpoint workflows
human approval before destructive actions

The model can follow instructions better than before, but the safer default is still harness-level enforcement.

There is also a philosophical spectrum here. On one end is the slow, controlled, review-heavy style: tight permissions, careful diffs, small steps. On the other end is the token-billionaire style: broad autonomy, relaxed permissions, let the agent explore. Both modes have a place. The mistake is not knowing which mode you are in.

If production data, money, reputation, or irreversible state is involved, I want the boring harness-level controls.

Tools, Skills, MCPs, and Hooks

Reusable behavior should not live only in ad-hoc prompt blobs.

The useful surfaces are becoming more explicit:

SKILL.md files
CLAUDE.md / AGENTS.md-style project instructions
MCP servers
hooks
slash commands
custom scripts
small local tools

SKILL.md is interesting because it packages procedure as context. It lets the agent load the right operational knowledge when a task appears.

MCPs are interesting because they turn external systems into agent-usable tools. A lot of enterprise APIs are unpleasant for humans; an MCP wrapper can make them tolerable for agents.

But MCPs are not always the best local interface. They can also pollute context if the tool surface is too large. For local work, a small SKILL.md, a direct CLI, or a tiny script can be more effective than a large protocol wrapper.

This is also why “return to primitives” keeps showing up. Modern agents often do well with boring terminal primitives: grep, find, python, jq, git, curl, small scripts. A clean primitive with clear output can beat a fancy integration with too much context.

Hooks are interesting because they move repeated rules out of the model and into the runtime. Instead of repeatedly saying “remember to run tests,” the harness can run tests. Instead of asking the model to update docs every time, a hook can enforce that flow.

The general rule: if something is deterministic, prefer a script or hook over another paragraph of instruction.

Memory Is Not Just Retrieval

For a long time, AI memory meant embeddings plus retrieval. That is too narrow.

In agent systems, memory includes:

durable project instructions
user preferences
prior decisions
tool quirks
failure notes
summaries of previous runs
source-of-truth documents
explicit “do not repeat this mistake” rules

The hard part is not storing more. The hard part is deciding what deserves to survive.

Bad memory becomes stale-rule debt. Good memory removes repeated steering.

Personal agents make this more important. If an agent is always available through WhatsApp, Telegram, iMessage, a terminal, or a browser, memory becomes part of the product. But it needs curation. Otherwise the agent slowly fills itself with slop.

The shape I like more than generic RAG is an LLM wiki: a hierarchy of markdown files that the agent can read, maintain, summarize, and improve over time. As context windows grow, the bottleneck is less “can I retrieve five chunks?” and more “what is the authoritative structure of what this agent knows?”

Agent Boards and Multi-Agent Workflows

Multi-agent tools are starting to look less like chat and more like project boards.

The pattern is:

split work into tickets
assign agents to independent tasks
run some tasks in parallel
serialize dependent tasks
merge outputs through review
keep a human in the loop for taste and acceptance

Tools like Paperclip, Mission Control, and Symphony point in this direction: agent work allocated across boards, roles, and dependency graphs.

The risk is obvious: this can produce unprecedented amounts of AI slop.

The hard part is not spawning more agents. The hard part is defining ownership, convergence, artifacts, and review gates.

Parallelism Needs Dependency Design

Multi-agent work starts to look like infrastructure planning.

Some tasks can run independently. Some must wait. Some require human approval. Some should never run automatically.

It is similar to Terraform in spirit: independent resources can apply in parallel, dependent resources must serialize, and dangerous changes need a plan/review step.

Agent boards need the same discipline.

Without dependency design, parallelism becomes chaos.

From Prompt Templates to Just-in-Time Software

Prompt templates and slash commands are useful. They encode repeated workflows and help non-technical users get consistent behavior.

But there is a better pattern when the task is repeatable:

Ask the agent to build software it can use later.

A script, extension, hook, or CLI can beat a long instruction file because it makes the behavior deterministic. This is especially true for repeated operations: parsing, formatting, checking, publishing, scraping, moving files, generating reports, running tests.

The best agentic workflow is not always “write a better prompt.” Often it is:

generate a small tool, then use the tool.

That is just-in-time software.

This is one of the strongest patterns in the whole talk. Do not keep adding prompt mass forever. If the agent repeatedly needs to do something precise, ask it to write a deterministic tool. Then future runs can call the tool instead of reinterpreting the instruction.

Auto-Improving Context

The next layer is self-improvement loops.

A simple version:

inspect previous agent sessions
find repeated failures
propose updates to CLAUDE.md, AGENTS.md, or SKILL.md
human reviews the changes
future runs improve

This is where agent systems start to feel like they are learning operationally, not because the model weights changed, but because the surrounding context and tooling improved.

Karpathy’s auto-research framing points in the same direction: define an objective, run experiments, update the loop by evidence.

The harness improves even when the model stays the same.

Always-on personal agents push this further. A background agent can run scheduled research, inspect previous conversations, draft memory updates, and improve its own operating context. I think of this as memory dreaming: the agent reviews what happened, proposes what should be remembered, and the human accepts or rejects it.

The human review part matters. Autonomous memory without curation becomes instruction garbage collection in reverse.

Why Dismissing Tools Too Early Is Dangerous

One thing my mentor pulled out in the conversation is important: it is easy to dismiss new AI tools as fad, slop, or unnecessary.

Sometimes that dismissal is correct.

But if the opinion comes from shallow glossing over instead of actual trial, you may miss the good part of the tool.

The better posture is empirical:

try it seriously
inspect what it changes
find the failure modes
keep the useful primitive
discard the hype

Keeping up with the agentic toolchain is almost a full-time job. But if you build software for a living, the primitives are becoming too important to ignore.

Also, because the conversation got informal by the end: yes, this includes the important cultural work of appreciating RAG Against the Machine and Seedhe Maut. Some primitives are technical. Some are motivational infrastructure.

What Engineers Are Actually Engineering Now

The engineering target is shifting.

We are not only engineering application code. We are engineering:

workflows
flowcharts
harnesses
permissions
control loops
memory systems
review gates
background jobs
agent-to-agent handoffs
context maintenance

That is the real frontier of agentic design.

The frontier is no longer only model frontier.

It is the design frontier:

can the loop preserve context?
can it protect itself?
can it recover?
can it improve across runs?
can it produce useful artifacts without drowning the human in slop?

The prompt matters, but only inside that architecture.