Frontiers in Agentic Design

Most early agent conversation treated the model prompt as the whole product. That lens was valid when prompts and context windows were the primary constraint.

The frontier shifted once models became reliable enough for long-horizon workflows. The visible quality now came less from phrasing and more from the architecture around it.

I now describe that as agentic design:

prompt quality still matters, but not enough by itself
harness quality tends to dominate real outcomes
runtime design determines failure recovery, trust, and repeatability

The Shift That Quietly Happened

A year ago, workflows were still dominated by:

prompt experiments
output window pressure
turn-by-turn babysitting
uncertainty around autonomous execution

Now that model quality has improved, the interesting questions changed:

what context enters each run?
how are tools selected and bounded?
what should persist across iterations?
when does a run stop?

That framing is why a lot of modern agent systems look like application stacks, not chat wrappers.

The real shift happened quietly. I noticed it in the second half of 2025: we moved from “short-window prompt behavior” to “long-running runtime behavior.” Benchmarks improved, and models that once failed quietly on longer tasks now held state longer. The bottleneck moved outward.

So when people ask “which frontier model should I use?”, I increasingly answer: first decide the harness.

It is also common to see “model quality wins” where most of the gain is actually from process quality: better guardrails, cleaner stop conditions, and steadier context flow. A little runtime architecture often looks like a lot more capability.

The architecture matters, and the model has to run inside it.

Prompt as a Subsystem

The old model was still elegant:

user prompt -> model answer

The practical model is this:

collect context
call tools
apply permissions and policies
run loop logic
capture outputs
decide continue/retry
persist useful state

In other words, the prompt is now one input to a larger machine.

If you want to understand “better agents,” inspect the machine.

The Small Loop Pattern

A tiny loop captures this clearly:

while :; do cat PROMPT.md | claude; done

I used this as a slide example because it is deceptively simple. A persistent prompt, repeated execution, and a stop condition create a real workflow.

Then the engineering questions explode:

how do you refresh task framing?
what is completion?
what context should be pruned?
what is safe for the model to execute?
how do you detect drift and wake to useful work?

The loop is trivial. The surrounding runtime is the system.

What a Useful Harness Includes

I treat the harness as the minimum production surface:

filesystem + terminal
tool calling and policy
memory boundaries
workflow templates
feedback, retry, backoff
scheduled/background execution
security guardrails

That combination is what I see in systems that feel stable under load.

Filesystem and terminal

Agents tend to be strongest where the interface is concrete and real: source, tests, logs, git, shells. If the runtime can already operate in this environment, you often avoid inventing new abstractions.

Security by environment

Safety is not “tell the model to be careful.” It is enforced boundaries:

restricted tools
sandboxing
network boundaries
checkpoints and review gates
permission policies in runtime, not only prompt text

Tools, skills, and prompts as code

This is where leverage compounds. Reusable behavior should live in stable surfaces:

SKILL.md conventions
MCP servers
hooks
slash commands
custom tools

These are “runtime procedures,” replacing ad-hoc one-shot prompt blobs.

Memory systems are more than retrieval

For a long while we treated memory as embeddings + retrieval. That is only half the story.

You also need to define:

what should survive across sessions
what is authoritative
how conflicting facts are merged
what cost each retention strategy has

Flat files, hierarchical notes, vector stores, and user profile documents all become memory depending on durability and conflict model.

Patterns Emerging at Scale

I now see four practical directions.

1) Extensible minimal agents

Minimal shells like Pi are attractive because they are not monoliths. You get a thin kernel and can script your operating model. It is close to the “Neovim” argument: the right default is tiny, the power is in extension and composition.

2) Scheduled personal operators

OpenClaw-like designs introduced strong continuity: scheduled runs, persistent identity files (SOUL.md-style), and reduced confirmation in routine work. The result is less chat and more stable daily operation.

3) Self-improving loops

The shift from “agent as worker” to “agent as tuner” is real.

I now see systems updating CLAUDE.md/AGENTS.md, feeding failures into local eval loops, and adjusting tooling choices or retry policy over time. Karpathy’s autoresearch framing is the same idea: define objective, run experiments, improve by evidence, not vibes.

In practice this is where engineering debt disappears from later steps: if the agent system can improve its own loop, you reduce repeated manual fixups and make each run slightly more reliable than the one before.

4) Multi-agent topologies

Swarms are appearing in two forms:

divide-and-conquer (clear boundaries, independent outputs, deterministic merge)
active coordination (leader + specialists, explicit synchronization points, controlled handoff)

The hard part is not spawning agents. It is setting ownership, handoff policy, and convergence conditions.

Why “Model Quality” Is Often Misread

When one agentic product feels better than another, people often over-credit the underlying model.

Operationally the difference is often runtime:

better context packing
better file and tool selection
stronger step decomposition
sharper stop logic and retries
cleaner operator ergonomics
guardrails that prevent silent failure

A good model is necessary. A weak harness still wastes that capability in repetitive drift.

What I Am Building

For me, this has become a systems-first problem:

flowcharts over slogans
background jobs over brittle manual loops
explicit review gates over implicit trust
durable context strategies over ad-hoc prompt memory

A quick sanity check I use now is less about model choice and more about control questions:

what environment is the model operating in?
what memory can it read and what can it write?
what tool is allowed where?
what is the stop condition and who declares success?
what gets persisted for the next run?

The leverage is in harness and loop quality: what the system chooses to remember, run, reject, and retry.

The Rule I Keep Using

The frontier is no longer only model frontier.

It is the design frontier:

can the loop preserve context?
can it protect itself?
can it recover?
can it improve across runs?

The prompt matters, but only inside that architecture.