Field notes: how production AI agents work

These notes come from working through production agent systems: framework quickstarts, demos that looked fine in the lab, traces that explained failures in prod, and the eval/observability vendors that sit on the build path. They are written for someone who has already run a weather bot and now needs a map of what comes next. Read the sections in order; later sections assume earlier ones.

01 · Two hello worlds

Every framework exposes two different entry points, and conflating them causes confusion for months.

The pure LLM hello world is a single chat completion. You send "Hello", you print the reply. That tests the model API and nothing else.

The agent hello world is the weather example: one get_weather() tool and a prompt like “what is the weather in San Francisco?” Nearly every orchestration quickstart ships some variant of this pattern.

A single tool call is still a one-step workflow. The model does not decide whether to run a second step or stop. That limitation is deliberate. The weather agent exists to verify the harness, the same way print("hello world") verifies that your compiler, linker, and runtime are wired. In a passing run, the tool schema serializes, the model emits a parseable tool call, your executor dispatches it, and the result returns to the caller. You are not proving intelligence. You are proving plumbing.

02 · What makes something an agent

An agent, in the sense that matters for production engineering, is a loop the model controls across multiple steps. It calls a tool, reads the result, and makes an explicit decision about whether to continue or stop. That continue/stop decision in the middle of the run is the line people skip when they label every chatbot-with-tools an “agent.”

In real systems you also need planning, tool use, memory, and judgment working together. The minimal example that actually requires agency is a dependent multi-hop task: you cannot specify step two until step one returns. “What should I pack?” might require reading a calendar, then calling weather, then answering. The model must iterate.

The exercise below is a set of patterns we kept misclassifying until we stated the criterion plainly: does the model need to loop with a real continue/stop choice between steps?

exercise · classification

Tool call or agent loop?

select a scenario

Criterion: explicit continue/stop between steps.

03 · Why long tasks fail in production

Current models can chain dozens of tool calls and run coding agents for hours. The ceiling in production is usually not “can it reason?” It is whether the run survives a long horizon without compounding error.

Each step can look acceptable in isolation. End-to-end success is the product of per-step reliability raised to the number of steps. At 95% per-step success over twenty steps, you are near a coin flip for the full task. That math matches what we saw when polished demos broke under real traffic: no single step looked broken, but the sequence failed.

exercise · compounding reliability

End-to-end success = (per-step reliability)ⁿ

Per-step reliability 95.0% Steps 20

- end-to-end

When you see that curve, checkpointing, partial recovery, and resume-after-failure stop being nice-to-haves. They are infrastructure. Swapping to a larger model does not change the exponent.

04 · The harness is a state machine

Framework choice should follow specification, not the other way around. Start from what a finished run looks like: a resolved support ticket, a merged pull request, a closed incident. Work backward from that terminal state.

You then define which tools are reachable, which states and transitions form the control flow, and what context sits in the model window at each step. The field has started calling that last part context engineering. The prompt is one input to it, not the whole job. Authoring the harness means deciding, step by step, what the model sees and which tools it can call.

Stated plainly: you are building a state machine with a stochastic transition function in the middle.

05 · The stochastic core

Everything you write in the harness is deterministic code. Tools run the same way each time. Context assembly is a function. Transitions are edges you defined. Then you call the model and receive text you could not have derived in advance.

“Stochastic” shows up in two different discussions, and mixing them wastes time.

The narrow sense is sampling variance: temperature, top-p, the weighted die over tokens. You can drive those toward zero. You still will not get bit-identical runs in practice because of batching, floating-point ordering, mixture-of-experts routing, and silent revisions behind the same model name.

The broader sense is unspecified behavior. There is no closed-form spec for the next customer message the way there is for parseInt on a novel string. Fixing sampling does not remove that layer.

exercise · resampling

Fixed prompt: “Where’s my order?”

Resample to observe variance.

temperature and input fixed

You do not debug the oracle directly. You constrain what reaches it, log what comes back, and measure distributions over outputs.

06 · Why evals exist

Classical tests assert output === expected. That breaks down for agents for two independent reasons.

Inputs are open-world. Customers phrase the same intent in different words and bring edge cases your demo never included. Outputs are non-deterministic. Many acceptable replies exist to “where is my order?” A scorer therefore checks properties: Was the answer grounded? Was the tone right? Was the correct tool used? Did the model invent policy?

Treat the prompt as versioned application code you change deliberately. Treat the eval dataset as your model of the live input distribution. A useful dataset contains different situations, not twenty-five paraphrases of one situation. Weight toward the tail, because that is where production systems fail after launch. Evals are how you regression-test that distribution. They are not a substitute for editing the prompt in a playground.

07 · The LLM lifecycle is a loop

Classical software often ships along a line: edit code, run tests, deploy, monitor. LLM systems add a feedback edge that has no clean analogue in pre-LLM release engineering.

You still edit prompts, run offline evals, deploy, and observe production. The additional step is annotation: a failed production trace becomes a row in the eval dataset. Production failures continuously feed the lab. LLMOps platforms are largely built to make that loop cheap.

Feedback from production into the eval dataset has no classical analogue.

08 · “Offline eval” is a data term

Machine learning borrowed the word offline. In agent systems it means you are scoring against a curated dataset, not against live production traffic. The job still calls model APIs over the network. The meaningful axis is which data the scorer sees, not whether your laptop has Wi-Fi.

Offline eval gives you a pre-deploy gate on situations you chose ahead of time. You can rerun it and compare runs. Online eval samples real traffic after deploy. There is usually no single expected string; you score a subset of what users actually sent. Online is where you discover new failure modes. Offline is where you pin them so they cannot return unnoticed.

09 · Traces and OpenTelemetry

When an agent fails in production, you need the full trajectory: which tools fired, in what order, with what prompts and token counts. HTTP access logs only tell you that a request completed in 1.3 seconds with a 200. That is an envelope. OpenTelemetry with GenAI semantic conventions is closer to the letter inside.

OpenTelemetry splits into three layers. The data model defines a span as a unit of work with timing, status, parent-child links, and attributes. That is where tool calls, prompts, and tokens live. Instrumentation is the SDK code that creates spans in your application. OTLP is the wire format that ships spans; it is transport, not the model.

Confusing HTTP logging with OTel is common because both “go over the network.” They answer different questions.

10 · Read the vendor hello world

The ecosystem is easier to navigate once you notice that each vendor’s first tutorial advertises the layer it owns.

LangGraph’s hello world is a calculator agent: a graph with a loop-back edge. It demonstrates orchestration. It may emit span events, but it does not ship the dashboard.

An eval product’s hello world is a scorer over a small dataset, sometimes without an LLM at all. It demonstrates measurement of outputs against examples.

An observability product’s hello world is a trivial run whose span tree renders in a UI. It demonstrates ingestion and display.

Those layers stack. An orchestration framework does not replace an observability backend, and neither replaces an eval platform.

11 · LangSmith, Langfuse, Braintrust

LangSmith, Langfuse, and Braintrust overlap heavily: tracing, evals, prompt storage, datasets, human annotation. Organizations standardize on one; running all three is rare. The choice tends to follow constraints rather than feature checklists.

LangSmith is the default when you are already on LangChain or LangGraph, because it instruments that runtime and exposes graph structure in traces. Langfuse fits teams that need open-source self-hosting for data residency or cost control; ClickHouse acquired Langfuse in early 2026. Braintrust fits teams that want deploy gates driven by eval scores, similar in spirit to CI test suites.

A separate axis is AI-native platforms (the three above, plus Arize, Galileo, Weave) versus general APM vendors (Datadog, New Relic, SigNoz) extending existing products with LLM span types. One camp models agent trajectories as first-class objects. The other embeds them in conventional request tracing.

12 · Most teams buy, they do not build this stack

The eval and observability stack in these notes serves teams that build agents in-house. That is a minority path in customer support and similar markets. The first fork is build versus buy, and ticket volume plus control requirements pick the default.

Below roughly two thousand resolutions per month, consumption products such as Intercom Fin or Ada are common; pricing is often near a dollar per resolution and setup is measured in hours. In the SMB band, Lindy sometimes pencils out when total support spend is under roughly half a million dollars per year. Enterprise deployments that need agents to take real actions in core systems often land on Sierra or Decagon, with six-figure contracts and per-resolution pricing in the two-to-five dollar range. Teams already on Zendesk or Salesforce frequently adopt Agentforce or Zendesk AI because the integration path is shortest.

The build path appears when volume is high or control requirements are hard. At on the order of thirty thousand resolutions per month, building can land near twelve to fifteen percent of the cost of buying, but you inherit operational ownership. LangSmith-class tooling shows up there as one component among many, not as what a team buys when they purchase Fin.

Closing definition

Agent engineering, stated in one place: engineer context and control flow as a state machine around a stochastic core, then measure the system against an input distribution you cannot fully enumerate in advance.

You need both halves. A harness without an eval loop ships blind. An eval loop without a harness has nothing stable to measure. Production work is learning both, usually by breaking both first.