AI Builder Notes - May 2026

AI-assisted notes from my liked-tweets feed, organized around agent workflows, browser traces, model loops, and guardrails.

Practical takeaways

Start with the workflow, not the agent. A useful agent task has a source of truth, a narrow action, a verifier, and a stop condition. “Review this repo” is vague. “Find auth bugs in these routes, cite file lines, run the relevant tests, and stop after the first credible exploit path” is a workflow.
1. Use dynamic workflows in claude code - to do the vibe bits for thinking through a workflow. Think of it like this - you can describe in natural language an entire workflow consisting of multiple agents at various steps - I want the docs updated, tests passed, security review done and also playwright tests done. Dynamic workflows figures out which parts can be divided in parallel and what should be done sequentially. Creates a flowchart - and writes JS code for it. Its a JS script that can execute subagents at scale and deterministically [1]
Planner/executor split is the way to go. Spend the expensive model on taste, decomposition, and risk discovery. Use cheaper or narrower models for repeatable implementation once the task has tests, rubrics, logs, or examples. [2]
Do not judge an agent workflow by the model name alone. If the loop has repo access, a rubric, a way to inspect tool calls, and a verifier, a less fashionable model can still do useful work. The Letta Code / GLM 5.1 review-bot example is interesting for that reason, not because “someone used X instead of Y” is interesting by itself. [3]
Prefer small interfaces to giant tool menus. MCP tool call definitions are rotting your context! The monday.com GraphQL example was the clearest cost warning: one task used 15k tokens through SDK/code-mode and 158k tokens through a real MCP server. MCP is useful, but a menu of tools is not automatically an efficient interface. [4] [5]
For browser work, save the trace. Run the workflow once, inspect wasted actions, replace repeated clicking with direct reads or JavaScript where safe, then save the better path as a skill. That is how browser agents become cheaper instead of just more automated. [6]
Security has to be designed into the harness. Stop rules, restart paths, permission gates, package-age delays, secret proxies, branch gates, logs, and human approval are the system. “Tell the model to be careful” is not a system.

Agent workflows

The useful version of “dynamic workflows” is mechanical. Give Claude Code a high-level task and say “workflow”. It writes an orchestration script. That script creates smaller work units, starts coordinated subagents, gives each one a bounded target, and then pulls their outputs back into one final answer or patch. [1]

That is useful when the task has real shape: inspect five services, compare three implementations, test each candidate fix, collect account-specific data from a logged-in browser, or review a large diff from multiple angles. It is a bad fit for questions where one careful answer is enough.

The same pattern showed up in smaller forms. One thread framed GPT-5.5 xhigh as the planner and Composer 2.5 subagents as implementers: the stronger model investigates, writes the plan, and delegates branches, worktrees, and PRs. [2] Cursor review skills running for 30 minutes are the same idea with time budget added: deeper search, more files read, more call paths followed, fewer drive-by comments than a quick /simplify. [7]

The “100 tool calls before answering” Codex prompt names the behavior missing from a lot of agent runs: do not stop after the first plausible answer. Read more. Falsify more. Show the trail. [8]

Tight coupling between the model and the harness: Claude Code and Codex fail differently, so the harness needs stop conditions, escape routes, and restart logic. [9] The model can plan the work, but the harness has to notice loops, stale branches, broken assumptions, tool spam, and cases where the agent should ask for help.

Model vs loop

The review-bot with Letta Code and GLM 5.1 case brings forth a useful question, that is: what did the loop provide that made a cheaper model viable? Repo context, a review objective, expected output shape, examples of good comments, and a way to reject junk comments can matter more than the logo on the model. [3]

The Ramp spreadsheet retrieval case is the same lesson from a different direction. A specialist RL-trained model reportedly beat Opus on a narrow spreadsheet retrieval task. [10] That does not mean every team needs custom RL. It means narrow, verifiable work can reward narrow training, narrow evals, and narrow interfaces.

If you know what you want your model to do, and you want to scale it. You aim narrow with the loop/harness. And you can get away with a much cheaper bill.

Command Code repairing tens of thousands of tool calls is another version of this. Tool use fails in repeatable ways: malformed JSON, wrong argument shape, missing state, wrong sequence, bad retry. If those errors can be repaired or caught automatically, the model gets a better workbench. [11]

The Cloudflare Code Mode / MCP comparison is a reminder again, that you should probably have lean MCP, less context rot. Or rather, only use MCP when you are accessing a remote service. Prefer CLI over MCP by default. Why: A GraphQL API task took 1 step and 15k tokens through SDK/code-mode, versus 4 steps and 158k tokens through a real MCP server. [4] [5] An agent interface is part of the product. Give the model a small, typed, task-shaped API when you can. Do not assume a broad tool menu is better because it feels more general.

Browser skills

The most concrete browser-agent example here is Hermes Agent / Autobrowse. A Hacker News workflow went from 102 seconds to 35 seconds, 23 turns to 8 turns, and $1.46 to $0.28 after the trace was simplified and saved as a skill. [12] [6]

The trick was not magic browser control. The trick was noticing the repeated slow path. If the agent clicks through the same UI every time, inspect the page, read state directly where possible, remove wasted navigation, and save the shorter path. That is a real skill: the agent gets faster because the workflow gets smaller.

The adjacent tools worth tracking: the OpenAI Chrome plugin, BrowserCode, Autobrowse, browser-harness, Pi browser extensions, and Hermes browser skills. [13] [14] [6] [15] [16] [12] The category is logged-in browser work: support queues, internal tools, research, scraping, QA, admin ops, and anything where the useful data sits behind a session.

Memory and retrieval

Birdclaw is interesting because it gives agents access to a Twitter archive. [17] GBrain points at a personal recall layer around OpenClaw / Hermes-style workflows. [18] PageIndex is a useful reminder that simple retrieval, even BM25-only retrieval, still has a place. [19] The “RAG comeback in about 8 months” take lands because the archive problem is still unsolved in practice. [20]

A giant archive is not memory. Memory is knowing when to search, what to retrieve, how much to inject, and how to preserve provenance. A liked-tweets feed becomes useful only if the distillation keeps links, dates, claims, and enough source texture to make the note auditable later.

Security and guardrails

Cloudflare tested Anthropic Mythos against fifty repositories. [21] Another thread said Claude Mythos Preview helped Firefox fix more security bugs in April than in the previous 15 months combined. [22] Read neither as “AI fixes security now”. Read them as scoped security work becoming agent-shaped: known repo, known bug class, patch candidates, review loop, and humans still responsible for merging.

The most useful boring guardrail here is package-age delay. pnpm and npm both have settings that can avoid installing packages published too recently. [23] [24] This matters more with agents because agents will happily install dependencies at machine speed. A small delay catches some supply-chain attacks before they enter the workflow.

Two defaults worth setting:

pnpm config set minimumReleaseAge 2880

npm config set min-release-age=2d

Clawvisor belongs in the same bucket: approve agent access without handing raw credentials to the model. [25] These dull permission layers are more interesting than another demo where an agent clicks around a dashboard with full access.

Tools worth opening

Harness engineering learning site: useful if you want names for the parts around the model - evals, stop rules, retries, logs, and verification.
LiteParse v2: Rust PDF parsing for agent/RAG workflows where PDFs are the bottleneck. The useful question is not “is it fast?” but “does it preserve the parts your downstream model needs?”
Patter: voice AI in a few lines, with multiple providers. Useful if you want to prototype voice workflows without first committing to one stack. [27]
Minions: mission-control style UI for Hermes Agent tasks. Worth opening if you are running multiple local agents and need a control plane. [28]
OpenRouter Pareto Code: route to the cheapest code-capable model above a score threshold. This is the right kind of boring optimization for agent loops that run often. [29]
OpenRouter Response Caching: useful for tests, retries, and repeated agent prefixes. Caching is not glamorous, but repeated context is where agent bills quietly grow. [30]
Flue: TypeScript sandboxed-agent framework with runtimes and a secret proxy. Useful shape: run the agent in a controlled runtime instead of giving it everything. [31]
Zero: programming language for agents with explicit capabilities, JSON diagnostics, and typed safe fixes. Worth saving because explicit capabilities are a cleaner interface than vibes and instructions. [32]