10 min read
What Changed When Agent Design Stopped Being Mostly Prompt Engineering

Most agent demos still get explained as if the interesting part is the prompt. That was closer to true when the models were weak, the output windows were short, and the agent usually fell over before the task got complicated.

That framing breaks once the models become good enough to stay in the loop for a while. Then the bottleneck moves outward. The thing that matters is not just what text you send to the model. It is the system around the model: the harness, the memory boundary, the control loop, the stop condition, the security model, and the interfaces you expose.

That is what I meant in my talk, “Frontiers in Agentic Design”. The short version is this: a lot of what gets attributed to model quality is actually context quality and harness quality. Once that clicked for me, a lot of the current landscape started making more sense.

Let’s get into it.

Everyday Cutting Edge Changes In AI

This was the cutting edge as of early April 2026.

The Shift Happened Quietly

A year ago, most people were still trying out agents in the default box they came in.

  • shorter output limits
  • weaker long-horizon behavior
  • more prompt babysitting
  • less trust in autonomous execution

That world trained people to think the main lever was phrasing. If the run was bad, maybe the prompt needed more detail. Maybe the system message needed more rules. Maybe the examples needed to be rearranged.

That still matters, but less than people think.

The second half of 2025 changed the baseline. Models got trained and RLed for longer-horizon work. The benchmarks followed. The frontier models became solid enough that it stopped being useful to talk about them as autocomplete with better branding.

Now the question is different.

If I hand an agent a real workflow, what makes it succeed or fail over hours, not turns?

That is a systems question.

The Prompt Is Inside a Larger Machine

The old mental model was simple: user writes prompt, model writes answer.

The current mental model looks more like this:

  1. gather context
  2. choose tools
  3. define memory surface
  4. define permissions
  5. define loop policy
  6. define success and stop conditions
  7. collect outputs
  8. feed useful results back into the next run

That one shift explains most of what people are now building around agents.

When you type a prompt into Claude Code or any similar coding agent, the prompt is only one part of the actual runtime. The rest is made of files, tools, commands, memory, hooks, workflow templates, and guardrails.

That framing matters. If you want to understand where the real moat is forming, do not just inspect the model. Inspect the runtime.

The Smallest Useful Agent Loop Is Already a Design Pattern

One of the slide examples in the talk was deliberately stupid:

while :; do cat PROMPT.md | claude; done

It looks like a joke. It is not.

That tiny loop already contains the start of agentic design:

  • a persistent task description in PROMPT.md
  • a repeated execution loop
  • a runtime that can keep working while you sleep
  • an implicit assumption that work continues until a stop condition is reached

Once you take that seriously, the next engineering questions appear immediately:

  • What updates PROMPT.md?
  • What counts as completion?
  • What state persists across iterations?
  • What gets pruned from context?
  • What is safe to execute automatically?
  • How do you wake up to something useful instead of six thousand tokens of drift?

The loop is not the hard part. The design around the loop is the hard part.

What the Harness Actually Includes

When I say harness, I mean the full environment and control surface around the model.

At minimum, that usually includes:

  • Filesystem + terminal
  • tool calling surface
  • security boundaries
  • memory system
  • workflow templates
  • feedback loops
  • scheduling and background execution

That list explains a lot of current agent systems.

Filesystem and terminal

Most serious coding agents operate in a filesystem and terminal because that is where useful work already lives.

  • source tree
  • config files
  • tests
  • logs
  • git state
  • package managers
  • shells

Models are increasingly trained around this environment. That is why a plain terminal plus files can still beat a lot of more “specialized” agent UIs. The environment is close to the work.

Security

Security is usually not solved by asking the model to be careful.

It is solved in the harness or outside it.

  • permission prompts
  • sandboxing
  • restricted tool surfaces
  • network boundaries
  • review checkpoints
  • policy enforced by the runtime, not by vibes

That one decision does a lot of work. If your security model depends on the assistant remembering to behave, you do not have a security model.

Skills, MCPs, hooks, and tool surfaces

This is where a lot of practical leverage comes from.

You can expose reusable procedures through:

  • SKILL.md style instructions
  • MCP servers
  • hooks
  • slash command workflows
  • plain custom tools

These are all variations on the same idea: move repeatable structure out of the prompt and into the runtime.

That matters because prompts are expensive places to store workflows. They are transient, easy to bloat, and easy to forget to update. A skill, tool, or workflow definition is a more stable surface.

Memory systems

Memory systems used to get discussed mostly as RAG engineering.

That was a whole phase:

  • chunking
  • reranking
  • embeddings
  • metadata filters
  • vector stores
  • knowledge graphs
  • caching layers

That work still matters. But in agents, memory is broader than retrieval.

The real question is: what does the system get to remember, in what form, and at what cost?

Possible answers include:

  • flat files
  • hierarchical notes
  • vector memory
  • knowledge bases
  • knowledge graphs
  • summaries of prior runs
  • durable user profiles

Karpathy’s LLMWiki is one useful reference point here. It makes the memory surface explicit instead of pretending the context window is enough.

Workflow prompts as runtime primitives

Claude Code slash commands are a simple example of this pattern.

Things like planning mode, gstack, or any staged command workflow are basically templated runtime behaviors. They ask questions, gather missing inputs, sequence steps, and apply prompting at each stage.

That is a workflow engine wearing a prompt costume.

It is less deterministic than a hardcoded state machine, but it is the same architectural move: give the model a reusable procedure instead of a one-shot instruction blob.

The More Interesting Patterns Are Emerging One Layer Up

Once you start looking at agents as runtimes, some recurring patterns show up.

1. Extensible minimal agents

Systems like Pi are interesting because they are minimal in the right place.

They do not try to arrive as complete operating systems for intelligence. They give you a thin base and let you prompt or script your workflow into existence.

That makes them feel a bit like the Neovim of agents.

Not because everyone should use them, but because extensibility is the product.

2. Personal agents with schedules and identity files

Systems like OpenClaw pushed another pattern forward:

  • scheduled execution
  • personal context files like SOUL.md
  • less waiting for user confirmation on every step
  • more continuity across runs

That setup changes the relationship completely. You are no longer just calling a tool. You are maintaining an operator loop with memory, schedule, and preferences.

3. Self-improving or self-editing agents

This is one of the more important frontiers.

A self-editing agent does not just act inside the environment. It updates the way it acts.

That can happen through:

  • learning from prior chats
  • extracting insights into CLAUDE.md or AGENTS.md
  • running evals and feeding failures back into the system prompt or workflow docs
  • adjusting tools, hooks, or memory policies based on observed failures

That is a real architectural shift.

The runtime starts becoming a thing that can tune its own future behavior.

Karpathy’s autoresearch points at the same direction from another angle. Run experiments. Maximize a success function. Timebox the attempt. Improve by hill climbing instead of waiting for manual intervention every turn.

4. Swarms and multi-agent coordination

A lot of current multi-agent systems fall into two patterns.

Divide and conquer

Break the work into parallel tasks that do not interfere with each other.

This is the clean pattern.

  • plan well
  • split by boundary
  • keep outputs separate
  • merge later

Active coordination

Use a leader agent and multiple specialists.

  • frontend agent
  • security agent
  • code quality agent
  • infra agent
  • reviewer agent

This is the messier pattern, but sometimes the right one. Systems like Paperclip, Mission Control, Symphony, and other swarm-style setups all explore different versions of this idea.

The hard part is not spinning up more agents. The hard part is deciding what they are allowed to share, when they synchronize, and who gets to declare success.

This Is Why So Much “Model Quality” Feels Misattributed

When people say one agent product feels much better than another, the explanation is often not just the base model.

It is often some combination of:

  • better context packing
  • better file selection
  • better workflow decomposition
  • better memory retrieval
  • better stop conditions
  • better retry behavior
  • better tool ergonomics
  • better guardrails

In other words, better systems design.

The model matters. Of course it does.

But a strong model inside a weak harness still wastes tokens, loses context, repeats work, and fails in boring ways. A strong harness around a strong model is where the step change starts to show up.

A lot of what gets attributed to “intelligence” is just good runtime engineering.

What I Am Actually Engineering Now

This is the practical conclusion I keep coming back to.

I am not trying to engineer prompts in isolation. I am engineering systems.

That means:

  • thinking in flowcharts
  • designing background jobs and maintenance loops
  • applying a DevOps mindset to agent runtimes
  • automating away toil
  • reducing operator unhappiness
  • evolving workflows instead of repeatedly restating instructions

Most of the leverage is in context. Some of the leverage is in the harness. Both matter more than people admit.

That is also why the old question, “what prompt are you using?” increasingly feels too small.

The better questions are:

  • What environment does the model operate in?
  • What memory does it have?
  • What tools can it call?
  • What happens after each step?
  • What gets persisted?
  • What gets reviewed?
  • What improves the next run?

Those are systems questions. That is where the real moat is beginning to form.

The Rule I Would Use Again

If I had to compress the whole talk into one line, it would be this:

As models get better, agent design becomes less about the prompt and more about the runtime around it.

That does not make prompts irrelevant. It just puts them in the right box.

The frontier is no longer only model capability. It is the architecture of the loop you build around the model, and how well that loop can remember, decide, recover, and improve.