Skip to content

fix: stop non-client daemon starts from orphaning a running daemon#195

Open
zm2231 wants to merge 2 commits into
openclaw:mainfrom
zm2231:fix/daemon-singleton-bind-guard
Open

fix: stop non-client daemon starts from orphaning a running daemon#195
zm2231 wants to merge 2 commits into
openclaw:mainfrom
zm2231:fix/daemon-singleton-bind-guard

Conversation

@zm2231
Copy link
Copy Markdown
Contributor

@zm2231 zm2231 commented May 31, 2026

Summary

Completes the fix for #191. Its client-side pieces (PID reconciliation and a
spawn mutex in DaemonClient) landed, but the proposed host-side guard ("on
daemon start, if a live daemon already owns the socket, do not start a second
one") did not. Explicit and launchd-driven starts (mcporter daemon start,
daemon start --foreground, detached children) bypass DaemonClient entirely
and call runDaemonHost directly, so each one unlinks and rebinds the socket,
leaving the previous daemon alive but unreachable. #191's own headline repro
(racing daemon start --foreground) still reproduces on current main.

This adds the host-side guard.

What's Included

  • src/daemon/host.ts
    • Before binding, probe whether a live daemon already owns the socket: it must
      answer a status request, report this socket path, and have a live pid. A
      bare connect is not enough, since a hung daemon still has its socket in
      listen(2) and the kernel accepts the connection.
    • Skip binding only when on-disk metadata already matches that live daemon's
      pid and socket, which is the same readiness contract DaemonClient enforces,
      so a skip can never strand a client. Otherwise rebind (replacing a hung,
      dead, foreign, or metadata-orphaned owner and writing correct metadata),
      which is what main does today. Corrupt or missing metadata is treated as
      non-matching.
    • Bind under a dedicated ${metadataPath}.bind lock, separate from the
      client's metadata lock, so a client awaiting readiness while holding that
      lock cannot deadlock the bind.

Repro and results

Minimal manual repro (any keep-alive config at ./mcporter.json):

mcporter --config ./mcporter.json daemon stop
mcporter --config ./mcporter.json daemon start --foreground &
mcporter --config ./mcporter.json daemon start --foreground &
sleep 3
ps -ax | grep -c "[d]aemon start --foreground"   # main: 2, this branch: 1

Reproduced on a second macOS machine against current main. Counts are surviving
daemon processes after the race settles.

Scenario main This branch
#191 repro recipe (race 2x daemon start --foreground) 2 1
daemon start --foreground x5, 5 trials 2 every trial 1 every trial
detached daemon start x5, 3 trials up to 3 1 every trial
implicit cold-start (mcporter list x10), 3 trials 1 1

A frozen daemon (SIGSTOP, socket still listening) is reclaimed by a replacement
in ~2s, and a live daemon whose metadata is missing is also rebound rather than
stranding the caller, so the guard does not break stale or hung-socket recovery.

Tests

  • tests/daemon-host.test.ts: probe (live / hung / foreign-socket / dead-pid /
    no-listener) and metadataMatches (match / mismatch / missing / corrupt).
  • tests/daemon.integration.test.ts: a 4-way daemon start --foreground race
    asserting one survivor (fails on main), and a missing-metadata case asserting
    the replacement rebinds and the client recovers.

Full suite: 717 passed, 3 skipped (124 files). pnpm check clean.

Relation to #191

#191 prevents redundant launch decisions from DaemonClient callers. This
prevents redundant binds from any start path, including the direct daemon start --foreground in #191's own repro. client.ts is unchanged.

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 31, 2026

Codex review: needs maintainer review before merge. Reviewed May 31, 2026, 8:58 PM ET / 00:58 UTC.

Summary
The PR adds a daemon-host bind lock, status/pid/socket liveness probe, metadata-match check, and daemon host/integration coverage to prevent direct daemon starts from rebinding over a live owner.

Reproducibility: yes. Current main source shows foreground/child starts enter runDaemonHost directly and the host unconditionally prepares and binds the socket; the PR body also provides before/after macOS process-count proof for the race.

Review metrics: 2 noteworthy metrics.

  • Diff surface: 1 daemon source file, 2 daemon test files. The runtime change is narrowly scoped to host binding behavior while validation lives in daemon-specific tests.
  • Regression cases: 5 probe cases, 4 metadata cases, 2 integration cases. The tests cover the prior liveness and metadata blockers plus the process race this PR is intended to fix.

Merge readiness
Overall: 🦞 diamond lobster
Proof: 🦞 diamond lobster
Patch quality: 🦞 diamond lobster
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Risk before merge

  • [P1] Merging intentionally changes explicit foreground/debug duplicate starts from taking over the socket to exiting when a healthy matching daemon already owns it; users who relied on repeated foreground starts as a takeover shortcut should use mcporter daemon restart or stop first.
  • [P1] The broader daemon lifecycle reconciliation discussed in the PR comments, including shared collectConfigLayers cleanup and reconcile-before-rebind behavior, remains separate from this narrow host bind guard.

Maintainer options:

  1. Accept singleton start semantics (recommended)
    Merge as-is if maintainers agree duplicate explicit or foreground starts should no-op against a healthy matching daemon and users should use daemon restart for intentional takeover.
  2. Add a small operator-facing note
    Before merge, ask for a narrow CLI/docs note explaining that foreground duplicate starts exit when another daemon already owns the socket.
  3. Defer for full lifecycle reconciliation
    Pause this PR only if maintainers want the config-layer unification, reconcile-when-fresh, and stop-before-rebind decisions solved together instead of landing the duplicate-bind fix now.

Next step before merge

  • No automated repair is needed; a maintainer should accept the singleton-start compatibility tradeoff and merge or request a small operator-facing note.

Security
Cleared: The diff only changes daemon host logic and daemon tests, with no dependency, workflow, credential, release, or package-resolution changes found.

Review details

Best possible solution:

Land the narrow host bind guard with its regression coverage, then track broader daemon lifecycle reconciliation separately if maintainers want the reference-branch behavior.

Do we have a high-confidence way to reproduce the issue?

Yes. Current main source shows foreground/child starts enter runDaemonHost directly and the host unconditionally prepares and binds the socket; the PR body also provides before/after macOS process-count proof for the race.

Is this the best way to solve the issue?

Yes. The host-side bind lock plus status/pid/socket probe and metadata-match guard is the narrowest place to cover direct starts that bypass DaemonClient, with broader lifecycle cleanup left out of scope.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 2bf7a5eab23f.

Label changes

Label changes:

  • remove merge-risk: 🚨 availability: Current PR review merge-risk labels are merge-risk: 🚨 compatibility.

Label justifications:

  • P1: The PR addresses a daemon lifecycle race that can orphan daemon processes and break long-lived agent/channel workflows for real users.
  • merge-risk: 🚨 compatibility: The patch deliberately changes duplicate explicit foreground starts from rebinding/taking over to exiting when a healthy matching daemon is already running.
  • rating: 🦞 diamond lobster: Overall readiness is 🦞 diamond lobster; proof is 🦞 diamond lobster and patch quality is 🦞 diamond lobster.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (live_output): The PR body includes after-fix live process-count results across foreground, detached, and cold-start scenarios, supplemented by a maintainer-side build and daemon-test verification comment.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes after-fix live process-count results across foreground, detached, and cold-start scenarios, supplemented by a maintainer-side build and daemon-test verification comment.
Evidence reviewed

What I checked:

  • Repository policy read: AGENTS.md was read in full; its conservative review, proof, and validation guidance was applied to this PR review. (AGENTS.md:1)
  • Foreground starts bypass the client guard on main: Current main calls runDaemonHost directly for foreground or child daemon starts, so these paths do not use DaemonClient's start lock before entering the host bind code. (src/cli/daemon-command.ts:98, 2bf7a5eab23f)
  • Current main host still binds unconditionally: Current main prepares/unlinks the socket before listening and writing metadata, with no host-side liveness or metadata-match guard around the bind. (src/daemon/host.ts:86, 2bf7a5eab23f)
  • Related client-side lifecycle work is already on main: Commit f4f2093 added the DaemonClient filesystem lock, verified status, and config freshness restart path; this PR targets the remaining host path rather than duplicating that client work. (src/daemon/client.ts:115, f4f209317ff5)
  • PR host guard implementation: The PR serializes host binding under a dedicated bind lock, probes a live daemon by status response plus live pid and matching socket, skips only when metadata matches, and otherwise rebinds and writes fresh metadata. (src/daemon/host.ts:191, 6d67074ea431)
  • PR regression coverage: The PR adds live/hung/foreign/dead/no-listener probe cases, metadata match/mismatch/missing/corrupt cases, and process-level integration tests for foreground races and missing-metadata rebinds. (tests/daemon.integration.test.ts:177, 6d67074ea431)

Likely related people:

  • Peter Steinberger: Git blame shows the current daemon host bind path dates to release commit 94e65ba, and git history shows Peter authored the recent f4f2093 daemon lifecycle reconciliation and 56be50f daemon host follow-up on main. (role: introduced behavior and recent area contributor; confidence: high; commits: 94e65ba0572e, f4f209317ff5, 56be50f76354; files: src/daemon/host.ts, src/daemon/client.ts, tests/daemon-client-lifecycle.test.ts)
  • LDMB123: The PR discussion includes a maintainer-side verification comment for this head covering build plus the daemon host and integration tests, making this person useful routing context for merge readiness. (role: recent verifier; confidence: medium; files: tests/daemon-host.test.ts, tests/daemon.integration.test.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8747e4584f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/daemon/host.ts Outdated
Comment on lines +193 to +194
if (await isSocketAccepting(options.socketPath)) {
return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require status verification before abandoning the bind

When the existing daemon process is hung but its Unix socket is still in listen(2), this connect-only probe succeeds even though client.status() times out and the client is trying to recover by launching a replacement. In that restart path (sendRequest treats ETIMEDOUT as a transport error), the new child now exits here without rebinding, leaving the caller stuck talking to the unresponsive daemon until the startup wait fails. Please verify the socket answers a daemon status request, or allow rebinding after the old daemon is deemed unresponsive.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ae4bf18: the probe no longer treats a bare socket connect as proof of life. It now sends a status request and only skips rebinding when the response reports this socket path and a live pid; a hung (SIGSTOP), dead, or foreign listener falls through to prepareSocket() and rebinds, so stale-socket recovery is preserved.

Verified on macOS: a SIGSTOPped daemon (socket still listening) is reclaimed by a replacement in ~2s, while concurrent healthy foreground starts still settle to one daemon. Added probe unit tests (live / hung / foreign-socket / dead-pid / no-listener) and kept the foreground-race integration test.

I also dropped the daemon start --foreground precheck from the PR: it called client.status() with the default timeout, which stalled ~30s against a hung daemon. The host-side probe is the single guard now.

@clawsweeper clawsweeper Bot added proof: sufficient Contributor real behavior proof is sufficient. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels May 31, 2026
@zm2231 zm2231 force-pushed the fix/daemon-singleton-bind-guard branch from 8747e45 to ae4bf18 Compare May 31, 2026 19:12
Explicit and launchd-driven starts (daemon start, daemon start --foreground,
detached children) bypass the client's start lock and call runDaemonHost
directly. With no guard each one unlinks and rebinds the daemon socket,
orphaning the previous daemon.

Claim the socket under a dedicated bind lock (distinct from the client's
metadata lock, to avoid deadlocking a client that holds it while awaiting
readiness). Skip rebinding only when a status probe proves a live daemon owns
the socket: it must answer the daemon protocol, report this socket path, and
have a live pid. A hung, dead, or foreign listener falls through to reclaim the
socket, preserving stale-socket recovery.

Adds probe unit tests and a foreground-race integration test.
@zm2231 zm2231 force-pushed the fix/daemon-singleton-bind-guard branch from ae4bf18 to d28069e Compare May 31, 2026 19:17
@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. P1 Urgent regression or broken agent/channel workflow affecting real users now. merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. merge-risk: 🚨 availability 🚨 Merging this PR could cause crashes, hangs, restart loops, stalls, or process outages. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. labels May 31, 2026
@zm2231
Copy link
Copy Markdown
Contributor Author

zm2231 commented May 31, 2026

@clawsweeper re-review

The P1 is addressed in d28069e: the host bind guard no longer treats a bare socket connect as proof of life. It now verifies a status response that reports this socket path and a live pid, and falls through to prepareSocket() for hung, dead, or foreign listeners (so stale-socket recovery is preserved). Added probe unit tests (live / hung / foreign-socket / dead-pid / no-listener) plus the foreground-race integration test.

Note: the daemon start --foreground precheck mentioned in the prior review was removed — it called client.status() with the default timeout and stalled ~30s against a hung daemon. The host-side probe is the single guard now.

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 31, 2026

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

…e daemon

The guard skipped binding whenever a live daemon answered on the socket, even
when the metadata file was missing or pointed at a different pid/socket. A client
that launched a replacement then polled readiness until timeout, because its
readiness check requires metadata to match the responder.

Skip the bind only when on-disk metadata already matches the live responder's pid
and socket (the client's readiness contract). Otherwise rebind, which replaces a
stale or foreign owner and writes correct metadata, the recovery main performs.

The bind uses a dedicated ${metadataPath}.bind lock, separate from the client's
metadata lock, so a client awaiting readiness while holding that lock cannot
deadlock the bind.
@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. labels May 31, 2026
@zm2231
Copy link
Copy Markdown
Contributor Author

zm2231 commented May 31, 2026

Pushed 6d67074 for the metadata blocker. The host guard now skips binding only when on-disk metadata already matches the live responder's pid and socket (the same contract DaemonClient.readVerifiedStatus enforces). If a live daemon owns the socket but metadata is missing or mismatched, it rebinds and writes correct metadata instead of skipping, so a client that launched a replacement can't be stranded. Corrupt or missing metadata is treated as non-matching. Added unit tests for match/mismatch/missing/corrupt plus an integration test for the missing-metadata rebind. client.ts is unchanged.

While digging into the metadata path I hit two issues that turned out to predate this PR and live in the client lifecycle code. I didn't realize that scope when I opened the PR, so I've kept this one narrow (just the host bind guard) and put the fuller exploration on a separate reference branch rather than expanding the PR:

  1. Divergent collectConfigLayers. DaemonClient has its own copy with a no-config-file fallback layer, while the shared src/daemon/config-layers.ts one (used by the host to write metadata.configLayers) has no fallback. For an import-only / no-config-file daemon the two disagree, so the client treats a freshly started daemon as stale. Unifying them is the clean fix.
  2. Reconcile vs. rebind. When a live daemon owns the socket with missing or stale metadata, this PR rebinds (matching today's behavior). An alternative is to reconcile metadata to the live daemon when its config is still fresh, and only rebind when stale, so a healthy daemon isn't replaced.

Reference branch with both (not a PR): main...zm2231:explore/daemon-lifecycle-full. It has one known follow-up of its own (stopping a stale daemon before rebind instead of leaving it orphaned). Happy to split the collectConfigLayers unification into its own small PR if that's useful.

Copy link
Copy Markdown
Contributor

LDMB123 commented May 31, 2026

Maintainer-side verification for current head 6d67074ea4319a0a576a411dbdf1b6cd0c091a64 from a fresh PR worktree:

  • pnpm build -> passed.
  • pnpm exec vitest run tests/daemon-host.test.ts tests/daemon.integration.test.ts with host Unix-socket/process access -> 17 passed.

The remaining inline thread is outdated, and the author's fix reply matches the behavior covered by the passing daemon-host tests. I attempted to resolve the thread, approve the PR, and merge it from this account, but GitHub rejected those actions (403 Resource not accessible by integration for resolve/merge, 422 for approval without explicit repository access). A maintainer with repository access can resolve the stale thread and merge this head.

@zm2231
Copy link
Copy Markdown
Contributor Author

zm2231 commented May 31, 2026

Follow-up on the underlying issue the review here surfaced.

Working through the liveness and metadata feedback on this PR led into the daemon lifecycle reconciliation that landed in #191, and exposed a pre-existing problem under it: the host and the client decide config freshness using two different collectConfigLayers implementations. The client's copy adds a no-config-file fallback layer that the shared one (which the host uses to write metadata.configLayers) does not. So for an import-only or no-config-file daemon, the host records layers the client then judges stale, and the client keeps restarting a daemon that just started. This is independent of this PR; it lives in the #191 lifecycle code.

I kept this PR to the host bind guard, the part that directly fixes the duplicate and orphaned-daemon races, and worked the underlying fix out on a separate branch so it can be judged on its own:

It builds on this PR's two commits with three more:

  1. Unify collectConfigLayers (7131ac2). Collapses the two copies into one shared helper so the host and client agree on freshness. This is the standalone fix for the issue above, and the cleanest piece to take first.
  2. Reconcile-when-fresh (5ae6487). For the "live daemon with missing or stale metadata" case flagged in the review here: instead of rebinding and replacing a healthy daemon, if the live daemon's config is still current, rewrite metadata from the daemon's own status so the client's readiness check matches. Rebind only when the config is actually stale.
  3. Stop-before-rebind (2923669). When the config is stale, stop the live daemon and wait for its socket to disappear before rebinding, so it is shut down rather than orphaned. Cleanup unlinks metadata before the socket so a vanished socket proves cleanup finished, and a ref'd timer holds the event loop through the pre-listen takeover so the process does not drain and exit before binding.

Tests (full suite green: 718 passed, 3 skipped):

  • tests/daemon-host.test.ts: isDaemonResponding covers a live daemon with matching socket and live pid (true), a hung listener that accepts but never replies (false), a status reporting a different socket (false), a dead pid (false), and nothing listening (false). metadataMatches covers matching pid and socket (true), a different pid (false), a missing file (false), and corrupt JSON (false).
  • tests/daemon.integration.test.ts (real processes, skipped on Windows): four daemon start --foreground racing settle to one survivor (fails on main); missing metadata with fresh config reconciles to the live daemon and the replacement exits; stale config stops the old daemon (asserts it exits) and the replacement takes over.

Reference only, not a PR. Happy to send the collectConfigLayers unification as its own small PR.

@zm2231
Copy link
Copy Markdown
Contributor Author

zm2231 commented Jun 1, 2026

The current label set still carries the earlier round's P1 and merge-risk tags next to the upgraded rating: 🦞 diamond lobster / status: ready for maintainer look, and the last review predates LDMB123's maintainer-side verification (build + 17 daemon tests on this head) and the follow-up above. Re-reviewing to reconcile.

@clawsweeper re-review

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented Jun 1, 2026

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot removed the merge-risk: 🚨 availability 🚨 Merging this PR could cause crashes, hangs, restart loops, stalls, or process outages. label Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. P1 Urgent regression or broken agent/channel workflow affecting real users now. proof: sufficient Contributor real behavior proof is sufficient. rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants