6 min read
What Happened When I Tried to Coordinate Two AI Agents Over NFS

I wanted two independent agent instances (different machines, different runtime environments) to stay in sync on shared intent. The workflow was not real-time chat. It was slower, operational collaboration:

  • keep sessions local
  • exchange actionable notes
  • share a durable profile for the user/operator context
  • sync skills and preferences across both
  • avoid building a full message platform

I wanted the lowest possible surface area.

Yea, if you follow AI tooling, you might be thinking: why not ACP? Because ACP does not solve cross-machine coordination or durable shared state. ACP is local editor-agent transport only.

My first attempt was the obvious one: a shared filesystem bus.

bus/
  inbox-a/
  inbox-b/
  archive-a/
  archive-b/
  PROTOCOL.txt

The protocol was intentionally minimal.

  • one message file per event
  • sender writes into peer inbox
  • unread = in inbox
  • processed = moved to archive
  • messages are JSON with type, IDs, and payload
  • runtime state remains local to each node

That looked robust enough for a first pass.

Why the Shared Mount Failed in Practice

It failed in a way that is hard to debug because it did not explode quickly.

The shared storage was remote and mounted. From shell prompts it looked normal, but it was not a local FS in the consistency sense. Directory listing behavior was cached.

I saw cases like:

  • write occurred on machine A
  • machine B’s listing still showed empty inbox
  • remote checks showed file existed
  • cache clear did not immediately fix visibility

The mount configuration exposed a dir-cache-time of about 30 minutes.

At 12-hour communication cadence, that cache window felt long but manageable on paper. In reality, it was exactly why “shared state” was not trustworthy enough.

The hardest part was not transporting bytes. The hardest part was confidence:

  • Did the recipient actually see the message?
  • If not, is the sender still waiting?
  • Is the listing stale or write failed?

A bus that can lie with delay is worse than no bus at all.

I also learned that reliability here is not about speed but about meaning: every stale read introduces a second-order failure where humans spend effort reconciling state that should have been deterministic.

That overhead kills the value of automation. I wanted a system where “does it exist?” and “can I act on it?” become the same question.

Why Not “Just Use a Real NFS Backend?”

The obvious fallback was to replace object-backed mount behavior with a heavier file sharing setup:

  • EFS / NFS share
  • private network path
  • Tailscale/VPN rules
  • mount lifecycle management

That would probably improve consistency, but it would also move the problem to infra babysitting. For a low-frequency collaboration loop, that was the wrong trade.

I needed correctness with less operational overhead.

The Design Change: Git as the Coordination Layer

I switched from “live mount as truth” to “explicitly synced artifacts as truth”.

The new model was:

  • one shared repo for durable bus state
  • explicit git pull before read
  • explicit git commit && push after write
  • message lifecycle in files, but lifecycle visibility is commit-based
  • explicit job boundaries so no local process assumes immediate remote truth

Protocol became deterministic:

  1. git pull
  2. read inbox
  3. move handled message to archive
  4. git add -A && git commit && git push
  5. write outbound messages
  6. git add -A && git commit && git push

No implicit state. No unverified visibility assumptions.

Repo Layout That Stayed Honest

I kept the bus repo intentionally small:

bus-repo/
  inbox-a/
  inbox-b/
  archive-a/
  archive-b/
  user-sync/
    USER.md
  README.md

Messages carried intent rather than code payload:

  • pull skills
  • ack
  • profile updated
  • heartbeat
  • state transition markers

That prevented the bus from turning into a second replication system.

Separate Planes: Skills vs Messages

I also separated the concerns formally:

  • skills repo → source for reusable procedures and skill definitions
  • bus repo → coordination artifacts + profile state

Keeping them separate avoided cross-contamination and gave each repo a clear failure model.

USER.md Merge Was a Good Example of Semantic State

USER.md was not just text; it represented preferences and operator memory. Line-based merging was the wrong abstraction.

For this file, I moved to in-model semantic merge:

  1. pull shared USER.md
  2. read local USER.md
  3. combine durable facts
  4. de-duplicate and keep signal over noise
  5. write back canonical merged version
  6. commit if changed

If both agents had written conflicting style, a human could still read what each represented. The system then resolved by semantic intent, not blind textual merge.

Automation: Keep It Predictable

Once transport became explicit, the automation layer became straightforward:

  • scheduled skill sync from local sources
  • filtered push of user-authored skills only
  • periodic bus sync job
  • immediate writes only for important state changes

The loop became boring in the best sense: reliable and low drama.

Why This Won Over Filesystem Sharing

Not because Git is magically fast.

Not because Git is elegant.

Because the failure model became auditable.

With mounts, failures were ambiguous:

  • is file written?
  • is the listing stale?
  • is the remote copy real truth?

With Git:

  • did we pull?
  • did we commit?
  • did we push?
  • did the other side fetch?

Each state transition has an explicit artifact and a timestamped point in history. That is what made trust cheap.

I am not arguing this is the best bus for everything. If you truly need low-latency collaborative editing, you still need a different substrate.

But for durable coordination across machines with sparse interaction, this model gives you stronger invariants than most “shared directory” approaches:

  • deterministic visibility
  • fewer hidden assumptions
  • simpler failure triage
  • easier operational recovery

Rule I Keep

For low-frequency agent coordination, explicit synchronization usually beats pretend real-time shared folders.

If your transport can silently go stale, it is not a transport for control.

Git was slower, but in this case reliability is not a secondary requirement.

It was the product.