Early career notes here. I’ve been playing with agent observability systems: framework quickstarts, lab demos, and production traces. If you’ve built a basic weather bot and want to know what breaks next, here is the map.
From weather bot to agent
The ecosystem has two hello worlds. The LLM version is a single chat completion: send "Hello", print the reply, and your API works. The agent version is the classic weather example: define a get_weather() tool, ask “what is the weather in SF?”, and watch the model emit a tool call.
But that weather example only proves your harness is wired. A real agent needs a loop it actually controls across multiple steps. It must call a tool, read the result, and choose whether to keep going or stop. That middle continue/stop decision is where the real action happens.
Real systems add planning, memory, and serious judgment around that loop. Take a task like “what should I pack?” It requires real agency. It has to read your calendar, call the weather API for the destination, and then answer. You can’t even specify step two until step one comes back.
exercise · classification
Tool call or agent loop?
select a scenario
Criterion: explicit continue/stop between steps.
Long horizons need a harness
Current models can chain dozens of tool calls and run for hours. Frontier models are becoming more agentic because they’re RL’ed to do more tool calls, which means better context. But in production, the real ceiling is keeping the run coherent over a long horizon.
Each step can look acceptable in isolation while the whole sequence fails. End-to-end success is per-step reliability raised to the number of steps. At 95% success over twenty steps, your full run is basically a coin flip. That’s why polished demos break under real traffic. Individual steps pass, and then the run just drifts away.
exercise · compounding reliability
End-to-end success = (per-step reliability)n
- end-to-end
Once you see that curve, checkpointing, partial recovery, and resume-after-failure become mandatory infrastructure. A larger model might bump up per-step reliability, but that exponent isn’t going anywhere. Put bluntly: you are building a state machine with a stochastic transition function right in the middle.
The stochastic part
The harness itself is just pure, deterministic code. Tools run the same way every time, context assembly is a basic function, and transitions are edges you defined. But then the run hits the model, and it returns text that was impossible to derive in advance.
People use “stochastic” for two different problems, and mixing them up wastes time. The narrow sense is just sampling variance: temperature and top-p. You can drive those down to zero, but you’ll still miss bit-identical runs because batching, floating-point ordering, mixture-of-experts routing, and silent revisions sit underneath the exact same model name.
The broader sense is simply underspecification. There is no closed-form spec for the next customer message the way there is for parseInt on a new string. Fixing your sampling leaves that layer completely untouched.
exercise · resampling
Fixed prompt: “Where’s my order?”
Resample to observe variance.
Don’t debug the oracle. Constrain what reaches it, log what comes back, and measure the output distributions.
Evals and the feedback loop
Classical tests just assert output === expected. But agent runs break that pattern: inputs are open-world, and outputs have tons of valid forms. There are a million perfectly acceptable replies to “where is my order?” Instead, a scorer has to check properties like grounding, tone, correct tool use, and whether the model invented a random policy.
Think of your prompt as versioned application code, and treat your eval dataset as your live input distribution. A useful dataset favors totally different situations over twenty-five paraphrases of the same situation. Weight it toward the tail: that’s where production systems go to die.
LLM systems add a useful feedback edge: a failed production trace gets annotated and becomes a new row in your eval dataset. Production failures feed the lab, and LLMOps platforms are trying hard to make that loop cheap.
Offline eval scores your curated examples ahead of time as a pre-deploy gate. Online eval samples real, live production traffic after you deploy. Online finds the brand-new failure modes, and offline pins them down so they can never return unnoticed.
Traces and vendors
When an agent fails in production, you need the full trajectory. Which tools fired? In what order? With what prompts and token counts? Standard HTTP access logs just tell you a request completed in 1.3 seconds with a 200 OK. That’s just the envelope. OpenTelemetry with GenAI semantic conventions is how you read the actual letter inside.
Vendor hello worlds are basically positioning documents. LangGraph’s hello world is a calculator agent: a graph with a loop-back edge to show off orchestration. An eval product’s hello world is a scorer over a tiny dataset to demonstrate measurement. An observability product’s hello world is just a trivial run whose span tree renders clearly in a UI.
Those layers stack up, and platforms like LangSmith, Langfuse, and Braintrust overlap heavily. LangSmith is the default if you’re already on LangChain or LangGraph because it instruments that runtime well. Langfuse is strong for teams that want open-source self-hosting for cost control, and ClickHouse acquired them in early 2026. Braintrust fits if you want deploy gates driven by eval scores right in your CI test suites.
Build or buy
All this eval and observability tooling is useful for teams building agents in-house, but that’s a minority path in markets like customer support. Your ticket volume usually makes the choice for you.
If you are doing under two thousand resolutions per month, consumption products like Intercom Fin or Ada are a no-brainer: setup takes a few hours and costs about a dollar per resolution. In the SMB band, Lindy pencils out great if total support spend is under half a million dollars a year. Enterprise deployments landing on Sierra or Decagon bring six-figure contracts and pricing in the two-to-five dollar range. If you’re already on Zendesk or Salesforce, you’ll probably just adopt Agentforce or Zendesk AI because the integration path is so short.
But when does the build path appear? Around thirty thousand resolutions per month, building your own agent can land near twelve to fifteen percent of the cost of buying one. But beware: you inherit total operational ownership. That’s where LangSmith-class tooling shows up as one piece of a much larger stack.
Closing definition
If I had to define agent engineering in one sentence: you engineer context and control flow as a state machine around a stochastic core, and then you measure it against an input distribution you can’t fully enumerate. Most teams learn that the hard way: by breaking both of them first.