loc bengaluru, ist | local --:-- srijanshukla18@gmail.com
[post]/ai/notes-on-agent-engineering

Notes on agent engineering

/ 8 min read· ai

Notes on agent engineering: what counts as an agent, why long tasks fail, harness design, eval loops, observability, and when teams build vs buy.

Notes on agent engineering

Notes on agent engineering: what counts as an agent, why long tasks fail, harness design, eval loops, observability, and when teams build vs buy.

These notes come from trying agent o11y systems: framework quickstarts, lab demos, prod traces, and eval/observability vendors. The reader I had in mind has already built something like the weather bot and now needs a map of what starts breaking next.

From weather bot to agent

The ecosystem has two hello worlds. The pure LLM version is a single chat completion: send "Hello", print the reply, confirm the model API works. The agent version is usually the weather example: define one get_weather() tool, ask “what is the weather in San Francisco?”, and watch the model emit a tool call.

That weather example is useful, but it only proves the harness is wired. The schema serializes, the model returns a parseable tool call, the executor dispatches it, and the result comes back. A real agent needs a loop the model controls across multiple steps: call a tool, read the result, then choose whether to continue or stop. The interesting line is that middle continue/stop decision.

Real systems add planning, tool use, memory, and judgment around that loop. A small task like “what should I pack?” can already require agency: read the calendar, call weather for the destination, then answer. You cannot specify step two until step one returns.

The exercise below has patterns we kept misclassifying until we made the criterion boring: does the model need to loop with a real continue/stop choice between steps?

exercise · classification

Tool call or agent loop?

select a scenario

Criterion: explicit continue/stop between steps.

Long horizons need a harness

Current models can chain dozens of tool calls and run coding agents for hours. In production, the ceiling is usually run coherence across a long horizon.

The frontier models are becoming more and more agentic, because they are RL’ed to do more tool calls. That can improve reliability - generally, more tool calls means better context for the model.

Each step can look acceptable in isolation while the sequence fails. End-to-end success is per-step reliability raised to the number of steps; at 95% per-step success over twenty steps, the full run is near a coin flip. That matched polished demos breaking under real traffic: individual steps passed inspection, then the run drifted.

exercise · compounding reliability

End-to-end success = (per-step reliability)n

- end-to-end

Once you see that curve, checkpointing, partial recovery, and resume-after-failure become infrastructure. A larger model may raise per-step reliability, but the exponent stays.

Framework choice should follow the spec. Start from the terminal state - a resolved support ticket, a merged pull request, a closed incident - and work backward into reachable tools, states, transitions, and the context in the model window at each step. The field calls that context engineering. The prompt is only one input; authoring the harness means deciding what the model sees, which tools it can call, and when the run should advance.

Put bluntly: you are building a state machine with a stochastic transition function in the middle.

The stochastic part

The harness is deterministic code. Tools run the same way each time, context assembly is a function, and transitions are edges you defined. Then the run reaches the model and returns text that was impossible to derive in advance.

“Stochastic” gets used for two different problems, and mixing them wastes time. The narrow sense is sampling variance: temperature, top-p, the weighted die over tokens. You can drive those toward zero and still miss bit-identical runs, because batching, floating-point ordering, mixture-of-experts routing, and silent revisions sit underneath the same model name.

The broader sense is underspecification. There is no closed-form spec for the next customer message the way there is for parseInt on a novel string. Fixing sampling leaves that layer untouched.

exercise · resampling

Fixed prompt: “Where’s my order?”

Resample to observe variance.

temperature and input fixed

The oracle itself is the wrong debugging target. You constrain what reaches it, log what comes back, and measure output distributions.

Evals and the feedback loop

Classical tests assert output === expected. Agent runs break that pattern in two ways: inputs are open-world, and outputs have many valid forms. Customers phrase the same intent differently and bring edge cases your demo never included; many acceptable replies exist to “where is my order?” A scorer checks properties such as grounding, tone, correct tool use, and invented policy.

Treat the prompt as versioned application code. Treat the eval dataset as your model of the live input distribution. A useful dataset favors different situations over twenty-five paraphrases of one situation. Weight toward the tail; that is where production systems fail. Evals regression-test that distribution and sit alongside prompt editing in a playground.

Classical software often ships along a line: edit code, run tests, deploy, monitor. LLM systems add a feedback edge: a failed production trace gets annotated and becomes a row in the eval dataset. Production failures feed the lab, and LLMOps platforms try to make that loop cheap.

Edit prompt = write code Offline eval = CI tests Deploy = ship Observe = monitoring Annotate failures trace → test case prod failures → eval set
Feedback from production into the eval dataset has no classical analogue.

Offline names the dataset being scored. In agent systems, offline eval scores curated examples; online eval samples live production traffic. The job can still call model APIs over the network.

Offline eval gives you a pre-deploy gate on situations you chose ahead of time - you can rerun it and compare. Online eval samples real traffic after deploy. There is usually no single expected string; you score a subset of what users sent. Online finds new failure modes. Offline pins them so they cannot return unnoticed.

Traces and vendors

When an agent fails in production, you need the full trajectory: which tools fired, in what order, with what prompts and token counts. HTTP access logs say a request completed in 1.3 seconds with a 200. That is the envelope. OpenTelemetry with GenAI semantic conventions is closer to the letter inside.

Three OpenTelemetry layers matter here. The data model defines a span as a unit of work with timing, status, parent-child links, and attributes. Tool calls, prompts, and tokens live there. Instrumentation is SDK code that creates spans. OTLP is the wire format that ships spans; it is transport for the model data.

Vendor hello worlds are positioning documents. Each first tutorial advertises the layer the vendor owns.

LangGraph’s hello world is a calculator agent: a graph with a loop-back edge. It demonstrates orchestration and may emit span events, while dashboarding belongs to the observability layer.

An eval product’s hello world is a scorer over a small dataset, sometimes using deterministic checks. It demonstrates measurement against examples.

An observability product’s hello world is a trivial run whose span tree renders in a UI. It demonstrates ingestion and display.

Those layers stack. Orchestration, observability, and evals solve different problems.

LangSmith, Langfuse, and Braintrust overlap heavily: tracing, evals, prompt storage, datasets, human annotation. Organizations usually standardize on one; running all three is rare. Choice follows constraints more than feature checklists.

LangSmith is the default when you are already on LangChain or LangGraph, because it instruments that runtime and exposes graph structure in traces. Langfuse fits teams that need open-source self-hosting for data residency or cost control; ClickHouse acquired Langfuse in early 2026. Braintrust fits teams that want deploy gates driven by eval scores, similar to CI test suites.

There is another split: AI-native platforms (the three above, plus Arize, Galileo, Weave) versus general APM vendors (Datadog, New Relic, SigNoz) adding LLM span types. One camp models agent trajectories as first-class objects. The other embeds them in conventional tracing.

Build or buy

The eval and observability stack in these notes serves teams that build agents in-house. That is a minority path in customer support and similar markets. The first fork is build versus buy; ticket volume and control requirements usually pick the default.

Below roughly two thousand resolutions per month, consumption products such as Intercom Fin or Ada are common; pricing is often near a dollar per resolution and setup is measured in hours. In the SMB band, Lindy can pencil out when total support spend is under roughly half a million dollars per year. Enterprise deployments that need agents to take real actions in core systems often land on Sierra or Decagon, with six-figure contracts and per-resolution pricing in the two-to-five dollar range. Teams already on Zendesk or Salesforce often adopt Agentforce or Zendesk AI because the integration path is shortest.

The build path appears when volume is high or control requirements are hard. Around thirty thousand resolutions per month, building can land near twelve to fifteen percent of buying, but you inherit operational ownership. LangSmith-class tooling shows up there as one component among many, separate from the support product purchase itself.

Closing definition

Agent engineering, one sentence: engineer context and control flow as a state machine around a stochastic core, then measure against an input distribution you cannot fully enumerate.

Production needs both: a harness that gives evals a stable target, and an eval loop that keeps the harness honest. Most teams learn that by breaking both first.