2026-06-01 · Chuan Liu
On the implementation details that make or break a small agent runtime
agentcon is about 1000 lines of Python. Its design fits on a napkin. But six implementation details — most of them invisible from the README — decide whether it actually survives production.
A few weeks ago I wrote about agentcon, a small agent runtime built on a single idea: the main conversation and background tasks share one inbox.
That post was about the abstraction. This one is about what almost killed it — six implementation details that look small from the outside, but each took at least one production incident to take seriously.
Most framework writeups skip this layer. I think that's a mistake. The abstraction is the part that's easy to copy. The details are where the design either survives or doesn't.
—
One. Return implies result.
The most natural way for a developer to write a background worker is the shortest one: do something asynchronous, return a value, and stop. No ceremony.
But my first version of the framework didn't handle that. If the worker didn't explicitly emit a RESULT event before returning, the main conversation would never wake up. The task finished. The value was sitting there. But the inbox never got the event, so the main loop kept waiting forever.
I fixed it by having the runner check the worker's status after it returned, and synthesize a RESULT event if the user forgot. It's a few lines of code. But it changed the mental contract: you can write the worker as short as you want, and the framework will do the right thing.
The best test of an abstraction: how short can someone get away with using it before something breaks? If the answer is shorter than they expected, you've designed well. If the answer is longer, you've leaked an obligation onto them.
—
Two. close() must wake the main loop.
The main loop blocks on inbox.get(). That's the whole point of the architecture — workers push events; the loop consumes them. So when a session needs to shut down, I did the obvious thing: set a closed flag, cancel all worker tasks, and return.
Tested it. The session never actually shut down.
The main loop was sleeping inside inbox.get(). It doesn't poll the closed flag. It can't see the flag change. A flag that nobody reads is just memory.
The fix is two lines: have close() push a sentinel message into the inbox, and have the loop check the closed flag immediately after each get returns. The sentinel wakes the loop; the flag check ends the loop.
This kind of bug doesn't exist in synchronous code. You can't have a sleeping consumer that misses a state change because there's no sleeping in the first place. Async makes some problems easier and creates a whole new class of bugs where the wake-up semantics matter as much as the logic.
—
Three. History trimming must understand the protocol.
OpenAI's tool-calling protocol requires that a message with role=tool be immediately preceded by an assistant message with tool_calls. The two pieces are bound together. Break that pairing and the API returns 400.
Naive history truncation — keep the last N messages — does exactly that. It can chop the assistant message and leave its tool messages stranded at the new beginning of the history. The next request fails with a cryptic error.
This bug never shows up in demos. It only appears in production, in long conversations, intermittently. Debugging it is painful because the symptom is at the API boundary, but the cause is two layers up in your history management.
The fix is protocol-aware truncation: before deleting messages, advance the cut point until the new first message is something the API will accept as a starting point.
Most AI engineering is actually protocol engineering. The LLM is the easy part.
—
Four. Sub-agent history must be isolated.
When you let an agent spawn a sub-agent that runs its own LLM loop, the most important architectural decision is what NOT to share. In agentcon, sub-agents have private message lists. Their intermediate reasoning, tool calls, and retries never touch the main conversation. Only the final result gets pushed back as a RESULT payload.
This sounds obvious in retrospect. It is not what frameworks do by default.
The temptation is to share everything — let the parent agent see what the sub-agent is thinking, so it can supervise. Three things go wrong if you give in.
Context window: a four-iteration sub-agent burns 5000+ tokens of intermediate state. Five sub-agents burn 25K. Your main conversation drowns in working memory that wasn't yours.
Coherence: the parent shouldn't reason over the sub-agent's internal monologue. It should reason over the sub-agent's conclusion. These are different things, and mixing them makes the parent worse at both.
Composition: once sub-agents have private state, they can recursively spawn their own sub-agents without polluting anyone else's context. Without isolation, recursion is impossible — the context explodes after one level.
The general principle, transferable far beyond agents: when you spawn a process, give it its own scratchpad. Share the conclusion, not the work.
—
Five. Single-exit write lock.
Multiple background workers can call emit concurrently. Each emit eventually writes to the transport — WebSocket, stdio, or whatever the application uses to talk to the user.
My first version was a one-liner: call send_raw, return. It looked fine. It wasn't.
Two workers emitting at the same time can interleave their frames. The client receives garbled half-messages. Worse, if send_raw raises — network hiccup, client disconnected — the worker crashes. "Failure to deliver a UI event" should never kill a background task.
The fix is six lines: an asyncio.Lock to serialize, a try block to absorb transport death, a no-op when there's no transport.
The principle: at any place where N producers fan into one external resource, you need to serialize and you need to swallow the resource dying. Both. Not one or the other. This is not novel, but it's the kind of thing you forget when you're focused on the interesting parts of the system.
The unsexy parts of a framework are where it earns trust.
—
Six. Tracing is a separate channel.
There are two kinds of events the system emits.
One kind goes to the LLM: the inbox events. They are asynchronous, content-rich, and they wake up turns. They're the agent's working memory.
Another kind goes to observability: traces of LLM calls, tool calls, sub-agent lifecycle. They are synchronous, structure-rich, and they go to logs — Logfire, OpenTelemetry, JSONL, whatever the operator wants.
Most frameworks blend these into a single event stream and then add filters to keep traces out of the LLM's view. This is backwards. The two channels have different consumers, different schemas, different criticality, and different failure modes.
In agentcon they're two separate APIs sharing zero code. The session passes both to whoever needs them. Done.
The general lesson: if you find yourself adding filters to keep things out of a stream, you probably have two streams pretending to be one.
—
Looking back at the list, none of these is a clever algorithm. They're all the same kind of insight: a specific protocol obligation, a specific failure mode, a specific concurrency hazard. The kind of thing you only know after you've shipped.
I think this is what production AI engineering actually is, in 2026. The interesting part isn't the model. The interesting part is the dozens of small protocol-level decisions that decide whether your system survives a real user for a real week.
If that resonates — and especially if you've solved any of these differently — I'd love to hear about it. The repo is open and the issues are open: github.com/Rumonthebeat/agentcon