Most early agent conversation treated the model prompt as the whole product. That lens was valid when prompts and context windows were the primary constraint.
The frontier shifted once models became reliable enough for long-horizon workflows. The visible quality now came less from phrasing and more from the architecture around it.
I now describe that as agentic design:
- prompt quality still matters, but not enough by itself
- harness quality tends to dominate real outcomes
- runtime design determines failure recovery, trust, and repeatability
The Shift That Quietly Happened
A year ago, workflows were still dominated by:
- prompt experiments
- output window pressure
- turn-by-turn babysitting
- uncertainty around autonomous execution
Now that model quality has improved, the interesting questions changed:
- what context enters each run?
- how are tools selected and bounded?
- what should persist across iterations?
- when does a run stop?
That framing is why a lot of modern agent systems look like application stacks, not chat wrappers.
The real shift happened quietly. I noticed it in the second half of 2025: we moved from “short-window prompt behavior” to “long-running runtime behavior.” Benchmarks improved, and models that once failed quietly on longer tasks now held state longer. The bottleneck moved outward.
So when people ask “which frontier model should I use?”, I increasingly answer: first decide the harness.
It is also common to see “model quality wins” where most of the gain is actually from process quality: better guardrails, cleaner stop conditions, and steadier context flow. A little runtime architecture often looks like a lot more capability.
The architecture matters, and the model has to run inside it.
Prompt as a Subsystem
The old model was still elegant:
user prompt -> model answer
The practical model is this:
- collect context
- call tools
- apply permissions and policies
- run loop logic
- capture outputs
- decide continue/retry
- persist useful state
In other words, the prompt is now one input to a larger machine.
If you want to understand “better agents,” inspect the machine.
The Small Loop Pattern
A tiny loop captures this clearly:
while :; do cat PROMPT.md | claude; done
I used this as a slide example because it is deceptively simple. A persistent prompt, repeated execution, and a stop condition create a real workflow.
Then the engineering questions explode:
- how do you refresh task framing?
- what is completion?
- what context should be pruned?
- what is safe for the model to execute?
- how do you detect drift and wake to useful work?
The loop is trivial. The surrounding runtime is the system.
What a Useful Harness Includes
I treat the harness as the minimum production surface:
- filesystem + terminal
- tool calling and policy
- memory boundaries
- workflow templates
- feedback, retry, backoff
- scheduled/background execution
- security guardrails
That combination is what I see in systems that feel stable under load.
Filesystem and terminal
Agents tend to be strongest where the interface is concrete and real: source, tests, logs, git, shells. If the runtime can already operate in this environment, you often avoid inventing new abstractions.
Security by environment
Safety is not “tell the model to be careful.” It is enforced boundaries:
- restricted tools
- sandboxing
- network boundaries
- checkpoints and review gates
- permission policies in runtime, not only prompt text
Tools, skills, and prompts as code
This is where leverage compounds. Reusable behavior should live in stable surfaces:
- SKILL.md conventions
- MCP servers
- hooks
- slash commands
- custom tools
These are “runtime procedures,” replacing ad-hoc one-shot prompt blobs.
Memory systems are more than retrieval
For a long while we treated memory as embeddings + retrieval. That is only half the story.
You also need to define:
- what should survive across sessions
- what is authoritative
- how conflicting facts are merged
- what cost each retention strategy has
Flat files, hierarchical notes, vector stores, and user profile documents all become memory depending on durability and conflict model.
Patterns Emerging at Scale
I now see four practical directions.
1) Extensible minimal agents
Minimal shells like Pi are attractive because they are not monoliths. You get a thin kernel and can script your operating model. It is close to the “Neovim” argument: the right default is tiny, the power is in extension and composition.
2) Scheduled personal operators
OpenClaw-like designs introduced strong continuity: scheduled runs, persistent identity files (SOUL.md-style), and reduced confirmation in routine work. The result is less chat and more stable daily operation.
3) Self-improving loops
The shift from “agent as worker” to “agent as tuner” is real.
I now see systems updating CLAUDE.md/AGENTS.md, feeding failures into local eval loops, and adjusting tooling choices or retry policy over time. Karpathy’s autoresearch framing is the same idea: define objective, run experiments, improve by evidence, not vibes.
In practice this is where engineering debt disappears from later steps: if the agent system can improve its own loop, you reduce repeated manual fixups and make each run slightly more reliable than the one before.
4) Multi-agent topologies
Swarms are appearing in two forms:
- divide-and-conquer (clear boundaries, independent outputs, deterministic merge)
- active coordination (leader + specialists, explicit synchronization points, controlled handoff)
The hard part is not spawning agents. It is setting ownership, handoff policy, and convergence conditions.
Why “Model Quality” Is Often Misread
When one agentic product feels better than another, people often over-credit the underlying model.
Operationally the difference is often runtime:
- better context packing
- better file and tool selection
- stronger step decomposition
- sharper stop logic and retries
- cleaner operator ergonomics
- guardrails that prevent silent failure
A good model is necessary. A weak harness still wastes that capability in repetitive drift.
What I Am Building
For me, this has become a systems-first problem:
- flowcharts over slogans
- background jobs over brittle manual loops
- explicit review gates over implicit trust
- durable context strategies over ad-hoc prompt memory
A quick sanity check I use now is less about model choice and more about control questions:
- what environment is the model operating in?
- what memory can it read and what can it write?
- what tool is allowed where?
- what is the stop condition and who declares success?
- what gets persisted for the next run?
The leverage is in harness and loop quality: what the system chooses to remember, run, reject, and retry.
The Rule I Keep Using
The frontier is no longer only model frontier.
It is the design frontier:
- can the loop preserve context?
- can it protect itself?
- can it recover?
- can it improve across runs?
The prompt matters, but only inside that architecture.