fix: stop non-client daemon starts from orphaning a running daemon#195
fix: stop non-client daemon starts from orphaning a running daemon#195zm2231 wants to merge 2 commits into
Conversation
|
Codex review: needs maintainer review before merge. Reviewed May 31, 2026, 8:58 PM ET / 00:58 UTC. Summary Reproducibility: yes. Current main source shows foreground/child starts enter runDaemonHost directly and the host unconditionally prepares and binds the socket; the PR body also provides before/after macOS process-count proof for the race. Review metrics: 2 noteworthy metrics.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Land the narrow host bind guard with its regression coverage, then track broader daemon lifecycle reconciliation separately if maintainers want the reference-branch behavior. Do we have a high-confidence way to reproduce the issue? Yes. Current main source shows foreground/child starts enter runDaemonHost directly and the host unconditionally prepares and binds the socket; the PR body also provides before/after macOS process-count proof for the race. Is this the best way to solve the issue? Yes. The host-side bind lock plus status/pid/socket probe and metadata-match guard is the narrowest place to cover direct starts that bypass DaemonClient, with broader lifecycle cleanup left out of scope. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 2bf7a5eab23f. Label changesLabel changes:
Label justifications:
Evidence reviewedWhat I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8747e4584f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (await isSocketAccepting(options.socketPath)) { | ||
| return; |
There was a problem hiding this comment.
Require status verification before abandoning the bind
When the existing daemon process is hung but its Unix socket is still in listen(2), this connect-only probe succeeds even though client.status() times out and the client is trying to recover by launching a replacement. In that restart path (sendRequest treats ETIMEDOUT as a transport error), the new child now exits here without rebinding, leaving the caller stuck talking to the unresponsive daemon until the startup wait fails. Please verify the socket answers a daemon status request, or allow rebinding after the old daemon is deemed unresponsive.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in ae4bf18: the probe no longer treats a bare socket connect as proof of life. It now sends a status request and only skips rebinding when the response reports this socket path and a live pid; a hung (SIGSTOP), dead, or foreign listener falls through to prepareSocket() and rebinds, so stale-socket recovery is preserved.
Verified on macOS: a SIGSTOPped daemon (socket still listening) is reclaimed by a replacement in ~2s, while concurrent healthy foreground starts still settle to one daemon. Added probe unit tests (live / hung / foreign-socket / dead-pid / no-listener) and kept the foreground-race integration test.
I also dropped the daemon start --foreground precheck from the PR: it called client.status() with the default timeout, which stalled ~30s against a hung daemon. The host-side probe is the single guard now.
8747e45 to
ae4bf18
Compare
Explicit and launchd-driven starts (daemon start, daemon start --foreground, detached children) bypass the client's start lock and call runDaemonHost directly. With no guard each one unlinks and rebinds the daemon socket, orphaning the previous daemon. Claim the socket under a dedicated bind lock (distinct from the client's metadata lock, to avoid deadlocking a client that holds it while awaiting readiness). Skip rebinding only when a status probe proves a live daemon owns the socket: it must answer the daemon protocol, report this socket path, and have a live pid. A hung, dead, or foreign listener falls through to reclaim the socket, preserving stale-socket recovery. Adds probe unit tests and a foreground-race integration test.
ae4bf18 to
d28069e
Compare
|
@clawsweeper re-review The P1 is addressed in Note: the |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
…e daemon
The guard skipped binding whenever a live daemon answered on the socket, even
when the metadata file was missing or pointed at a different pid/socket. A client
that launched a replacement then polled readiness until timeout, because its
readiness check requires metadata to match the responder.
Skip the bind only when on-disk metadata already matches the live responder's pid
and socket (the client's readiness contract). Otherwise rebind, which replaces a
stale or foreign owner and writes correct metadata, the recovery main performs.
The bind uses a dedicated ${metadataPath}.bind lock, separate from the client's
metadata lock, so a client awaiting readiness while holding that lock cannot
deadlock the bind.
|
Pushed While digging into the metadata path I hit two issues that turned out to predate this PR and live in the client lifecycle code. I didn't realize that scope when I opened the PR, so I've kept this one narrow (just the host bind guard) and put the fuller exploration on a separate reference branch rather than expanding the PR:
Reference branch with both (not a PR): main...zm2231:explore/daemon-lifecycle-full. It has one known follow-up of its own (stopping a stale daemon before rebind instead of leaving it orphaned). Happy to split the |
|
Maintainer-side verification for current head
The remaining inline thread is outdated, and the author's fix reply matches the behavior covered by the passing daemon-host tests. I attempted to resolve the thread, approve the PR, and merge it from this account, but GitHub rejected those actions ( |
|
Follow-up on the underlying issue the review here surfaced. Working through the liveness and metadata feedback on this PR led into the daemon lifecycle reconciliation that landed in #191, and exposed a pre-existing problem under it: the host and the client decide config freshness using two different I kept this PR to the host bind guard, the part that directly fixes the duplicate and orphaned-daemon races, and worked the underlying fix out on a separate branch so it can be judged on its own:
It builds on this PR's two commits with three more:
Tests (full suite green: 718 passed, 3 skipped):
Reference only, not a PR. Happy to send the |
|
The current label set still carries the earlier round's @clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
Summary
Completes the fix for #191. Its client-side pieces (PID reconciliation and a
spawn mutex in
DaemonClient) landed, but the proposed host-side guard ("ondaemon start, if a live daemon already owns the socket, do not start a second
one") did not. Explicit and launchd-driven starts (
mcporter daemon start,daemon start --foreground, detached children) bypassDaemonCliententirelyand call
runDaemonHostdirectly, so each one unlinks and rebinds the socket,leaving the previous daemon alive but unreachable. #191's own headline repro
(racing
daemon start --foreground) still reproduces on currentmain.This adds the host-side guard.
What's Included
src/daemon/host.tsanswer a
statusrequest, report this socket path, and have a live pid. Abare connect is not enough, since a hung daemon still has its socket in
listen(2)and the kernel accepts the connection.pid and socket, which is the same readiness contract
DaemonClientenforces,so a skip can never strand a client. Otherwise rebind (replacing a hung,
dead, foreign, or metadata-orphaned owner and writing correct metadata),
which is what
maindoes today. Corrupt or missing metadata is treated asnon-matching.
${metadataPath}.bindlock, separate from theclient's metadata lock, so a client awaiting readiness while holding that
lock cannot deadlock the bind.
Repro and results
Minimal manual repro (any keep-alive config at ./mcporter.json):
Reproduced on a second macOS machine against current
main. Counts are survivingdaemon processes after the race settles.
maindaemon start --foreground)daemon start --foregroundx5, 5 trialsdaemon startx5, 3 trialsmcporter listx10), 3 trialsA frozen daemon (SIGSTOP, socket still listening) is reclaimed by a replacement
in ~2s, and a live daemon whose metadata is missing is also rebound rather than
stranding the caller, so the guard does not break stale or hung-socket recovery.
Tests
tests/daemon-host.test.ts: probe (live / hung / foreign-socket / dead-pid /no-listener) and
metadataMatches(match / mismatch / missing / corrupt).tests/daemon.integration.test.ts: a 4-waydaemon start --foregroundraceasserting one survivor (fails on
main), and a missing-metadata case assertingthe replacement rebinds and the client recovers.
Full suite: 717 passed, 3 skipped (124 files).
pnpm checkclean.Relation to #191
#191 prevents redundant launch decisions from
DaemonClientcallers. Thisprevents redundant binds from any start path, including the direct
daemon start --foregroundin #191's own repro.client.tsis unchanged.