What Happened When I Tried to Coordinate Two AI Agents Over NFS

I wanted two independent agent instances (different machines, different runtime environments) to stay in sync on shared intent. The workflow was not real-time chat. It was slower, operational collaboration:

keep sessions local
exchange actionable notes
share a durable profile for the user/operator context
sync skills and preferences across both
avoid building a full message platform

I wanted the lowest possible surface area.

Yea, if you follow AI tooling, you might be thinking: why not ACP? Because ACP does not solve cross-machine coordination or durable shared state. ACP is local editor-agent transport only.

My first attempt was the obvious one: a shared filesystem bus.

bus/
  inbox-a/
  inbox-b/
  archive-a/
  archive-b/
  PROTOCOL.txt

The protocol was intentionally minimal.

one message file per event
sender writes into peer inbox
unread = in inbox
processed = moved to archive
messages are JSON with type, IDs, and payload
runtime state remains local to each node

That looked robust enough for a first pass.

Why the Shared Mount Failed in Practice

It failed in a way that is hard to debug because it did not explode quickly.

The shared storage was remote and mounted. From shell prompts it looked normal, but it was not a local FS in the consistency sense. Directory listing behavior was cached.

I saw cases like:

write occurred on machine A
machine B’s listing still showed empty inbox
remote checks showed file existed
cache clear did not immediately fix visibility

The mount configuration exposed a dir-cache-time of about 30 minutes.

At 12-hour communication cadence, that cache window felt long but manageable on paper. In reality, it was exactly why “shared state” was not trustworthy enough.

The hardest part was not transporting bytes. The hardest part was confidence:

Did the recipient actually see the message?
If not, is the sender still waiting?
Is the listing stale or write failed?

A bus that can lie with delay is worse than no bus at all.

I also learned that reliability here is not about speed but about meaning: every stale read introduces a second-order failure where humans spend effort reconciling state that should have been deterministic.

That overhead kills the value of automation. I wanted a system where “does it exist?” and “can I act on it?” become the same question.

Why Not “Just Use a Real NFS Backend?”

The obvious fallback was to replace object-backed mount behavior with a heavier file sharing setup:

EFS / NFS share
private network path
Tailscale/VPN rules
mount lifecycle management

That would probably improve consistency, but it would also move the problem to infra babysitting. For a low-frequency collaboration loop, that was the wrong trade.

I needed correctness with less operational overhead.

The Design Change: Git as the Coordination Layer

I switched from “live mount as truth” to “explicitly synced artifacts as truth”.

The new model was:

one shared repo for durable bus state
explicit git pull before read
explicit git commit && push after write
message lifecycle in files, but lifecycle visibility is commit-based
explicit job boundaries so no local process assumes immediate remote truth

Protocol became deterministic:

git pull
read inbox
move handled message to archive
git add -A && git commit && git push
write outbound messages
git add -A && git commit && git push

No implicit state. No unverified visibility assumptions.

Repo Layout That Stayed Honest

I kept the bus repo intentionally small:

bus-repo/
  inbox-a/
  inbox-b/
  archive-a/
  archive-b/
  user-sync/
    USER.md
  README.md

Messages carried intent rather than code payload:

pull skills
ack
profile updated
heartbeat
state transition markers

That prevented the bus from turning into a second replication system.

Separate Planes: Skills vs Messages

I also separated the concerns formally:

skills repo → source for reusable procedures and skill definitions
bus repo → coordination artifacts + profile state

Keeping them separate avoided cross-contamination and gave each repo a clear failure model.

USER.md Merge Was a Good Example of Semantic State

USER.md was not just text; it represented preferences and operator memory. Line-based merging was the wrong abstraction.

For this file, I moved to in-model semantic merge:

pull shared USER.md
read local USER.md
combine durable facts
de-duplicate and keep signal over noise
write back canonical merged version
commit if changed

If both agents had written conflicting style, a human could still read what each represented. The system then resolved by semantic intent, not blind textual merge.

Automation: Keep It Predictable

Once transport became explicit, the automation layer became straightforward:

scheduled skill sync from local sources
filtered push of user-authored skills only
periodic bus sync job
immediate writes only for important state changes

The loop became boring in the best sense: reliable and low drama.

Not because Git is magically fast.

Not because Git is elegant.

Because the failure model became auditable.

With mounts, failures were ambiguous:

is file written?
is the listing stale?
is the remote copy real truth?

With Git:

did we pull?
did we commit?
did we push?
did the other side fetch?

Each state transition has an explicit artifact and a timestamped point in history. That is what made trust cheap.

I am not arguing this is the best bus for everything. If you truly need low-latency collaborative editing, you still need a different substrate.

But for durable coordination across machines with sparse interaction, this model gives you stronger invariants than most “shared directory” approaches:

deterministic visibility
fewer hidden assumptions
simpler failure triage
easier operational recovery

Rule I Keep

For low-frequency agent coordination, explicit synchronization usually beats pretend real-time shared folders.

If your transport can silently go stale, it is not a transport for control.

Git was slower, but in this case reliability is not a secondary requirement.

It was the product.