AI Builder Notes - April 2026

AI-assisted notes from my liked-tweets feed, organized around harnesses, managed agents, memory, workflow packaging, and enterprise AI work.

Practical takeaways

Start with the harness, not the model. A useful agent system has context, tools, prompts, skills, memory, caching, compaction, permissions, evals, and a way to judge output. “Use the best model” is vague. “Give the model this sandbox, these tools, these traces, this memory write path, and this test surface” is a harness.
1. The OpenAI Agents SDK, Meta-Harness, browser-harness, Agent CI, and Passmark all made the same point from different angles: the model sits inside a larger machine. That machine decides cost, reliability, permission boundaries, memory, evals, and final output quality. [1] [2] [3] [4] [5]
Copy evals first. Traces become training material for the harness. Generated code is a black-box artifact. Specs and tests become the objective. The useful systems were not “agent demos”; they were loops around traces, scores, assertions, and replay. [1] [2] [5]
Managed agents are becoming work surfaces. Claude Managed Agents had the cleanest product shape: brain/hands/session split, credentials outside the sandbox, durable event log, disposable Linux containers, OpenTelemetry, $0.08/session-hour, and a reported median time-to-first-byte drop of 60%. [6]
Persistent agents need persistent state. Codex, Claude Code routines, Gemini CLI subagents, pinned threads, scheduled work, automations, and heartbeats all pointed at the same requirement: the agent needs somewhere to work, remember, resume, and ask for permission. [7] [8] [9]
Memory is not storage. A folder of markdown, a graph, or a vector database is only starting material. Memory is deciding what enters context when, what gets compacted, and what provenance stays attached. [10] [11] [12] [13]
Enterprise AI money lives in the unglamorous 90%. Migrations, broken internal systems, compliance, support, data cleanup, job queues, search, docs, evals, permissions, and human review loops are where serious AI work turns into revenue. HireCade was the clean business example: $22M ARR, 95% margins, 5 people, no funding, 14 months. [14]

Harnesses

Two tools can call similar models and feel completely different. The difference is usually context, tools, prompts, skills, memory, caching, compaction, permissions, and provider behavior.

OpenAI Agents SDK made the parts explicit: agents, handoffs, tracing, long-running agents, sandboxes, and memory. [1] Meta-Harness treated harness improvement as an optimization loop over code, traces, and scores. [2]

browser-harness treated Chrome automation as a self-healing CDP harness instead of a brittle browser script. [3] Agent CI turned local GitHub Actions into a loop agents can execute against. [4] Passmark put natural-language regression tests, model assertions, caching, and telemetry around Playwright. [5]

The useful part was not that these tools had agents. It was that they wrapped the model in a measurable workbench. Evals are the part I would copy first.

Managed agents

Claude Managed Agents had a clean split between brain, hands, and session. Credentials stayed outside the sandbox. The work happened in disposable Linux containers. The event log was durable. OpenTelemetry was built in. [6]

Codex was moving toward a universal dev app: browser use, computer use, multi-terminal, SSH/devboxes, docs and PDFs, memory, plugins, and automations. Chronicle-style memory gave recent screen context without making the user repeat what they were doing. [7]

Claude Code routines, Codex automations, Gemini CLI subagents, pinned threads, scheduled work, and heartbeats were the same pattern with different surfaces. [8] [9] A persistent agent needs persistent state and a real work surface.

Shopify AI Toolkit and Cloudflare’s Agent Lee made the enterprise version concrete. Agents were getting write access to products, orders, inventory, SEO, images, Workers, R2, DNS, and error summaries. Protocol and permission layers mattered more than the chat UI. [15] [16]

Memory and retrieval

Memory was annoying because storage is easy to fake.

GBrain, LLMwiki, Rowboat, Obsidian AI tools, and Hermes memory workflows all attacked the same problem: turn raw tweets, chats, notes, decisions, and project history into recall that enters context at the right time. [10] [11] [12] [17] [13]

The useful pieces were simple: markdown stays human-facing, compaction is the memory write path, graph edges matter for people and claims, file search fails when the agent does not know it should search, and proactive injection is the hard part.

A giant archive is not memory. Memory is deciding what enters context when.

Workflow packaging

The meta-tool was SKILL.md: a solved path compressed into instructions, scripts, and boundaries.

The same pattern showed up everywhere. Do the work once. Keep the trace. Remove wasted steps. Save the workflow. Let the agent start from the compressed path next time.

The workflows worth stealing were small:

Database performance loop: seed repro data, optimize query, try ten indexes, measure impact.
Spec-first agent work: review SPEC.md boundaries, let agents work behind them, review contracts instead of every line.
Skillification loop: do the work once, turn the solved path into a skill.
HTML implementation notes: keep implementation-notes.html while working. Decisions, tradeoffs, and gaps stay readable.
Deterministic first: a cron job plus one LLM API call replaces most “agents.” Recurring workflows as code, not token burn.
Browser loop: agent sees state, acts in Chrome, verifies result.

Money

HireCade was the business link I saved: $22M ARR, 95% margins, 5 people, no funding, 14 months. Mostly annotation services, 30% service fee, large candidate database, long enterprise sales cycle. [14]

AI labs and AI-heavy companies need verification, review, annotation, migration, maintenance, and domain-specific human loops. “AI adoption consulting at $200+/hr” is real when it installs workflows.

Enterprise AI money is the ugly 90%: migrations, broken internal systems, compliance, support, data cleanup, job queues, search, docs, evals, permissions.

Ship, share, repeat still beats separate marketing. Code plus content plus product storytelling is a distribution system.

Most AI adoption stalls on missing work graphs: connected tools, permissions, memory, evals, workflow defaults, review gates, and someone willing to redesign the work.

Tools worth opening

Hermes Agent: open agent stack with skills, tools, memory, phone access, and cost visibility.
GBrain: personal total-recall layer around OpenClaw / Hermes-style workflows.
Cabinet: open-source startup OS with agents, schedules, KB, browser terminal, local-first.
browser-harness: self-healing browser harness for Claude Code / Codex-style work.
Pi: worth tracking for browser-adjacent agent workflows and logged-in work surfaces.
HyperFrames: agent-native HTML to MP4.
Firecrawl web-agent and Firecrawl Parse: web/PDF ingestion for agents.
Agent CI: local GitHub Actions for agents.
awesome-design-md: brand/design systems as markdown for agents.
Syncthing: boring peer-to-peer file sync, relevant because durable local files beat trapped SaaS databases.