I have a repeatable post-feature cleanup sequence:
- remove fallback code
- remove dead code
- update docs
I wanted this workflow often, so I tried creating a Pi agent for it.
Initially Pi itself suggested creating separate slash commands for each step. I did not like that. I did not want to run and remember three different slash commands every time I finished a feature.
So I tried the obvious shortcut instead. I built one extension command and packed all three cleanup steps into a single prompt.
That was the shortcut. And it backfired.
The model handled the first parts, then quietly failed to give the last step the same attention. The run looked complete. It was not complete.
That is the annoying version of failure in agent workflows. Not a crash. Not a refusal. A plausible-looking partial success.
Why One-Shot Looked Fine
One-shot looked efficient on paper:
- one user message
- one model turn
- one cleanup command
The original shape was simple enough:
Review the codebase and complete the following:
1) Remove fallback code
2) Remove dead code
3) Update all .md files in the repo
Nothing about that prompt looks obviously broken.
That framing matters. The problem was not clarity. The problem was asking one turn to carry multiple state transitions with no explicit handoff between them.
The model could appear to do the work without ever making it legible which steps were actually completed and which ones were implicitly dropped.
What The Extension Does Now
The extension now runs three bounded tasks in sequence:
fallbackdead-codedocs
Each task gets its own model turn.
Each turn also ends with an explicit gate:
Reply with exactly one of:
- STEP_DONE - if you completed the task above
- STEP_SKIPPED - if you skipped or did nothing
Then a brief one-line reason.
That is the important change.
The runtime no longer treats cleanup as one large prompt. It treats it as a series of small controlled transitions.
Why The Gate Matters
The extension is not trying to force every step to produce edits.
It is trying to force every step to produce an outcome.
That is a better contract.
For each task, the assistant must say one of two things:
STEP_DONESTEP_SKIPPED
If the reply does not contain a valid gate result, the extension retries the same task. Up to three attempts.
The pi extension snippet that worked finally implements sequential gated prompting:
async function runTask(pi, ctx, task, attempt = 1) {
await ctx.waitForIdle();
const seenEntryIds = new Set(
ctx.sessionManager.getBranch().map((entry) => entry.id),
);
pi.sendUserMessage(buildInstruction(task, attempt));
await waitForTurnToStart(ctx, seenEntryIds);
await ctx.waitForIdle();
const reply = getNewAssistantReply(ctx, seenEntryIds);
const result = parseTaskResult(reply);
if (result === "done" || result === "skipped") {
return result;
}
if (attempt >= MAX_ATTEMPTS) {
return "failed";
}
return runTask(pi, ctx, task, attempt + 1);
}
So the retry loop is not “retry until the model edits files.”
It is “retry until the model gives a valid explicit outcome.”
That one decision does a lot of work.
The Real Failure Mode
The original problem was not just that docs were sometimes skipped.
It was that the skip could be silent.
A silent skip is worse than an explicit skip. At least STEP_SKIPPED is observable. You can see it, log it, report it in UI, and decide whether that outcome is acceptable.
The extension now makes that distinction explicit:
- done is valid
- skipped is valid
- malformed or missing gate result is not valid
That gives the system a much cleaner control model.
The Shape Of The Runtime
The core loop is small, but it is more disciplined than the original version.
For each task, the extension:
- waits for the session to be idle
- snapshots the existing branch entry IDs
- sends one bounded instruction for the current task
- waits for the turn to start
- waits for the session to return to idle
- reads the new assistant reply only
- parses the gate result
- retries if the gate result is malformed or missing
That structure matters more than the prompt wording by itself.
Before understanding why the extension is reliable, it helps to understand what it is actually enforcing. It is not enforcing edits. It is enforcing explicit state.
Why This Wins In Practice
LLMs are generators, not transactional systems.
If you pack multiple cleanup phases into one prompt, the model has to infer sequencing, completion semantics, and stopping conditions all at once. That is where workflow drift starts.
Sequential gated prompting reduces that drift by making each step narrow and observable.
The trade-off is honest:
- more turns
- slightly more tokens
- clearer state transitions
- no silent terminal-step disappearance
And the extension adds one more useful distinction: skipped is not failure.
That matters because sometimes there really is nothing to change in one phase. If docs are already current, STEP_SKIPPED is the correct answer. The system should accept that and move on.
What it should not accept is ambiguity.
Why I Like This Pattern
A lot of agent failures come from pretending natural language alone is a workflow engine. It is not.
If a process has order, bounded phases, and different valid outcomes, you need explicit transitions between those phases.
In this case the valid outcomes are simple:
- done
- skipped
- failed to provide a valid result after retries
That is enough structure to make the automation trustworthy without overbuilding it.
The Takeaway
If your workflow depends on order, do not hide multiple state transitions inside one prompt.
Break the work into bounded steps. Require an explicit outcome for each one. Retry ambiguity, not honest skips.
That is the small shift that made this extension actually reliable.