loc bengaluru, ist | local --:-- srijanshukla18@gmail.com
[post]/ai/may-ai-builder-newsletter

AI Builder Newsletter - Month of May

/ 6 min read· ai

Monthly notes from my liked-tweets feed: dynamic workflows, planner-executor stacks, browser skills, code-mode over MCP bloat, security agents, and the economics of AI-assisted work.

May was about operationalizing agents.

Also sent as an email newsletter on Substack.

The useful questions moved from “which model is smartest?” to: how do I split planning from execution, keep context clean, let agents use browsers safely, turn repeated work into skills, price the workflow, and stop an agent from burning 200k tokens on something a small CLI could do?

The thesis

  • Single-agent chat is giving way to work systems: planner, implementers, browser worker, reviewers, memory, queue, evals, stop rules.
  • The harness still dominates. Small changes in tools, prompts, context, cache, browser access, or code-mode surface beat a model upgrade.
  • Browser control and skillification are compounding loops: run workflow, optimize it, save it as a skill, future runs get cheaper.
  • Security is moving from advice to architecture: auth proxies, secret vaults, min-release-age package rules, sandboxing, exact audit trails.
  • The commercial wedge is not AI news (yes, this is an AI newsletter). It is paid work: frontier lab data loops, FDEs, internal agents, AI adoption, support agents, review bots, distribution systems.

1. Dynamic workflows became the right abstraction

Claude Code dynamic workflows: say “workflow”, and Claude writes an orchestration script, spins up coordinated subagents. The unit is no longer a prompt.

  • GPT-5.5 xhigh plans, Composer 2.5 subagents implement.
  • A strong model investigates and writes the plan, then delegates branches/worktrees/PRs.
  • Cursor review skills run for 30 minutes and return better code review than /simplify. [link]
  • Codex can be pushed into deeper investigation by requiring at least 100 tool calls before answering. [link]
  • Claude Code and Codex fail differently, so the harness needs stop conditions, escape routes, and restart logic. [link]

Planner/executor split is table stakes. Expensive model for taste, decomposition, risk discovery. Cheaper or specialized agents for implementation. Review gates outside the model.

2. The model logo kept losing to the loop

A review bot with Letta Code ran on GLM 5.1, not GPT-5.5 or Opus. A 3B RL-trained model beat Opus on spreadsheet retrieval because the loop was narrow, verifiable, and repeatable. Command Code repaired tens of thousands of tool calls and made DeepSeek/Kimi match or beat closed models on some tool-using workflows.

Lesson: eval the loop, not the logo.

Code-mode vs MCP: one GraphQL API took 1 step and 15k tokens through SDK/code-mode, versus 4 steps and 158k tokens through a real MCP server. 8.4x the cost. MCP is useful. Tool menus are not automatically efficient. [link]

Boring interfaces that matter: CLIs tuned for agents, stable JSON diagnostics, explicit capabilities, cached prefixes, small wrappers around messy APIs, browser actions that verify state, readable HTML artifacts. The harness is where cost and reliability live.

3. Browser skills started compounding

Hermes Agent used Autobrowse to improve a browser skill. Two iterations: HN workflow went from 102 seconds to 35, 23 turns to 8, $1.46 to $0.28. It learned to eval JavaScript directly and saved the workflow as a skill.

Future shape: run workflow, inspect trace, simplify actions, save skill, repeat.

OpenAI’s Chrome plugin, BrowserCode, Autobrowse, browser-harness, Pi browser extensions, and Hermes browser skills are all pushing the same loop: support, internal tools, research, scraping, QA, admin ops, logged-in web UI work.

4. Memory and retrieval stayed unresolved

Birdclaw gave Codex access to a Twitter archive. GBrain kept pointing at personal recall across OpenClaw/Hermes workflows. PageIndex/BM25-only RAG and the “RAG comeback in about 8 months” take were reminders that retrieval is not dead. Bad retrieval was bad.

The shape I believe:

  • markdown is the human interface
  • search latency is a harness primitive
  • graph edges matter for people, projects, claims, dates, contradictions
  • compaction is the memory write path
  • context injection is the product

If the agent does not know it needs to search, a giant archive is inert. The hard problem is when and why memory enters context.

5. Security became workflow architecture

Cloudflare testing Anthropic Mythos against fifty repositories was the signal: offensive AI is strong enough that “patch faster” is not a full answer.

Claude Mythos Preview reportedly helped Firefox fix more security bugs in April than in the previous 15 months combined. Treat the number cautiously. The direction is obvious.

Immediately actionable:

pnpm config set minimumReleaseAge 2880
npm config set min-release-age=2d

Boring guardrails belong in agent environments: secret vaults, auth proxies, sandboxes, branch gates, human approval for destructive actions, exact logs. “Tell the model to be careful” is not a security model.

6. Tools worth opening

  • LiteParse v2 - Rust PDF parser, up to 100x faster, more accurate than common model-free parsers.
  • Patter - voice AI in four lines, MIT, works with multiple providers.
  • Autobrowse - browser agents that save optimized workflows into skills.
  • Minions - mission control for Hermes Agent tasks.
  • Clawvisor - authorization layer for agents, approve task access without exposing raw credentials.
  • Birdclaw - Twitter archive for agents.
  • OpenRouter Pareto Code - route to the cheapest code-capable model above a score threshold.
  • OpenRouter Response Caching - caching for tests and agent retries.
  • Flue - TypeScript sandboxed-agent framework with runtimes and secret proxy.
  • Zero - programming language for agents with explicit capabilities, JSON diagnostics, typed safe fixes.

Also saved: the harness engineering learning site and Vlad Feinberg’s frontier lab job guide.

7. Money and distribution

G2i was hiring 50 software engineers for frontier lab model-training work at 100 - 200 USD/hr, fully remote across 150+ countries. OpenAI Applied AI in Bengaluru was hiring ex-founder/CTO/AI PhD/MLE/DS profiles. Codex doing $100 tasks was a small but psychologically important signal: agents starting to cross from cost center to revenue operator.

  • review/annotation and lab data loops are real money
  • FDEs are structurally useful while AI changes fast
  • internal agents need services, not just software
  • AI adoption consulting is valuable when it installs workflows, not decks
  • support and customer-facing agent harnesses can become real SaaS
  • distribution still beats cleverness

The D2C vanity warning: 1M Instagram followers and flat sales for four months means the audience is not an asset. Distribution has to connect to buying behavior.

Uncomfortable take

AI builders are underpricing workflow taste.

Raw intelligence is getting cheaper. Scarce: knowing which workflow should exist, what can be deterministic, what needs a model, what must be verified, what should be a skill, when to stop, and where human judgment belongs.

Not prompt engineering. Systems engineering with a model in the loop.