peer: rotate samizdat credentials hourly (closes engineering#3437)#472
Open
myleshorton wants to merge 4 commits into
Open
peer: rotate samizdat credentials hourly (closes engineering#3437)#472myleshorton wants to merge 4 commits into
myleshorton wants to merge 4 commits into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds periodic rotation of peer “Share My Connection” samizdat credentials by re-registering with lantern-cloud, rebuilding the libbox inbound with fresh server config, swapping the active route/box, and best-effort deregistering the prior route to reduce the impact window of leaked credentials.
Changes:
- Introduces
Config.CredRotationIntervaland acredRotationLoop()goroutine that callsrotateCreds()on a schedule (default: 1 hour). - Persists port mapping details and run context on the
Clientso rotations can re-register against the same(external IP, port)tuple and rebuild libbox. - Adds
TestClient_RotatesCredentialsAtIntervalto verify repeated register/build/close behavior over a short interval.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
peer/peer.go |
Adds credential rotation loop, stores mapping/run context for reuse, and implements re-register → rebuild → swap → deregister sequence. |
peer/peer_test.go |
Extends stub server to vary register responses per call and adds a rotation-interval test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
3b19ec0 to
8cb3713
Compare
dea5216 to
830832c
Compare
8cb3713 to
5959bbe
Compare
0e55277 to
8fe02d4
Compare
be86ee9 to
7d616d6
Compare
Currently peer.Client builds the libbox inbound exactly once per Start
and holds the same X25519 keypair / shortID / masquerade for the
entire peer process lifetime — a leaked credential (logs, telemetry,
support bundles, the route_id leakage in engineering#3440) remains
usable for hours or days.
This adds a credRotationLoop goroutine on a 1h tick. On each tick:
1. Re-register with lantern-cloud against the same (address, port)
tuple — same router-side mapping, fresh server-side row, fresh
samizdat creds.
2. Patch the new options for VPN bypass.
3. Build a new libbox service.
4. Close the old box (releases the listening port).
5. Start the new box (re-binds the same port with new creds).
6. Atomic swap of c.box, c.routeID.
7. Best-effort deregister of the prior route_id so the bandit
stops handing the old (now-invalid) creds to clients within
~immediately rather than waiting up-to-TTL for the row to
expire.
Steps 4-5 leave a brief (~hundreds of ms) window where the port is
unbound; samizdat clients see TCP RST and reconnect via the bandit.
That's the trade-off vs. the security cost of holding the same cred
for the peer process lifetime — caps blast radius from cred leakage
to ~1h regardless of how long the peer has been running.
Rotation is best-effort: a single failure logs and waits for the
next tick. The current box and creds remain serving in the failure
case so a transient register error doesn't kill the session.
Config gains CredRotationInterval (defaults to peerCredRotationInterval
= 1h) so tests drive the loop without a 1h sleep — see
TestClient_RotatesCredentialsAtInterval.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8fe02d4 to
c995f4b
Compare
Eight findings on the cred-rotation lifecycle: 1. credRotationLoop time.NewTicker panic on non-positive interval. Now clamps interval <= 0 to peerCredRotationInterval (the same default Start applies when CredRotationInterval is unset) instead of letting NewTicker panic the host process. 2. rotateCreds raced Stop. Re-check c.active under the swap lock immediately before assigning c.box / c.routeID; if Stop has cleared that state in flight, close the newly-built box and deregister the new route rather than resurrecting peer state Stop just tore down. 3. heartbeatLoop 404 raced cred rotation. heartbeatLoop captured c.routeID, sent the heartbeat, and treated a 404 as authoritative — but a rotation between those two steps deregisters the captured route_id, producing a 404 that's actually a stale response and should not trigger auto-Stop. heartbeatLoop now re-checks the current routeID under lock on 404; if it differs, the 404 is expected and the loop continues. 4. Status.ExternalIP went stale after rotation. rotation re-queries ExternalIP for Register, so update c.status.ExternalIP alongside c.status.RouteID under the swap lock. 5. Newly-registered route leaked when BuildBoxService or the subsequent Start failed. Threaded a cleanupNewRoute(reason) closure through every post-Register error path so the orphan row is deregistered with a fresh ctx (avoids cancellation-skip) rather than leaking until TTL. 6. Long availability outage when newBox.Start failed after oldBox.Close released the port. New startNewBoxWithRetry retries Start up to 5 times with exponential backoff (50ms → 800ms, total <1s) to absorb router-side TIME_WAIT / EADDRINUSE windows; preserves the previous 'fall through to next tick' behavior only after the retries are exhausted. 7. libbox.Start panic could crash the host process during background rotation, taking the user's main VPN with it (vpn/tunnel.go's tunnel-start path has a parallel recover). New runRotation wrapper defers a recover + slog.Error and lets the loop continue. 8. TestClient_RotatesCredentialsAtInterval asserted deregisterCount >= rotations-1, which would pass even if one rotation never deregistered. Tightened to >= rotations via require.Eventually so the test waits for the most recent deregister to land (rotation issues it after the swap-lock completes, so there's a small race window between observing the new register and the corresponding deregister). Tests pass under -race -count=1 across 5 consecutive runs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Seven follow-ups from the round-2 re-review:
1. startNewBoxWithRetry now wraps the libbox.Start call in a deferred
recover that converts a panic into an error return. Without this,
runRotation's recover one frame up catches the panic but only after
rotateCreds' cleanupNewRoute path has been skipped — orphaning the
freshly-registered route and leaving the port unbound.
2. Stop→Start race during rotation. The previous `!c.active` re-check
at the swap lock missed the case where Stop cleared state AND a new
Start re-set active=true before this rotation reached the swap; the
old rotation would then overwrite the new session's box / routeID.
rotateCreds now captures c.runCtx as sessionRunCtx at the top of
the function and gates both the BuildBoxService call and the final
swap on pointer-identity with the current c.runCtx — different ctx
means a Stop→Start cycle replaced the session and our newBox / new
route belong to the prior session, so close + deregister.
3. Post-swap deregister of the prior route now uses a fresh
peerCleanupTimeout-bounded Background ctx instead of the rotation
ctx. Stop cancelling that ctx between the swap and the deregister
would have left the old (now-invalid) route in the server catalog
until TTL — defeating the entire stale-credential cap the rotation
is supposed to enforce.
4. startNewBoxWithRetry budget. The previous loop slept after every
failed attempt, including the fifth (no point — we wouldn't try
again), pushing total backoff to 1550ms when the comment claimed
<1s. Now skips the final sleep, total budget 750ms (50+100+200+400
across 4 sleeps), comment updated to match.
5. peerCredRotationInterval doc dropped an engineering#3440 reference
that violated AGENTS.md:13-17 (no ticket refs in code comments).
Kept the rationale ('H2 leakage paths') without naming the issue.
6. runRotation doc dropped the explicit 'vpn/tunnel.go' reference —
the 'the main tunnel start path already wraps it with recover for
the same reason' framing carries the same context without naming
the file.
7. TestClient_RotatesCredentialsAtInterval doc dropped the
engineering#3437 ticket reference; rewords as 'pins the rotation
invariant'.
All tests pass under -race -count=1 across 5 consecutive runs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two follow-ups from the round-3 re-review: 1. Test asserted against the internal c.routeID field directly. A future change that updated the routeID used by heartbeats but forgot to mirror it into Status.RouteID would still pass the test. Use CurrentStatus() so the assertion pins the observable contract. 2. rotateCreds docstring claimed 'On failure: the prior creds and box continue serving' across all failures, but a failure after oldBox.Close (startNewBoxWithRetry exhausts retries or panics) leaves the listener down until the next rotation tick rebinds. Narrow the docstring to distinguish failures before and after oldBox.Close; the router-side port mapping survives in both cases, only the in-process listener state differs. No behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the C2 finding from the Share My Connection security review (engineering#3437): peer.Client builds the libbox inbound exactly once per Start and holds the same samizdat creds for the entire peer process lifetime, so a leaked credential remains usable for hours or days.
This adds a `credRotationLoop` goroutine on a 1h tick that re-registers, rebuilds the libbox inbound with fresh creds, and deregisters the prior route. Caps blast radius from credential leakage to ~1h regardless of how long the peer has been running.
Sequence per rotation
Steps 4-5 leave a brief (~hundreds of ms) window where the port is unbound; samizdat clients see TCP RST and reconnect via the bandit. Acceptable trade-off vs. the security cost of indefinite cred lifetime.
Failure handling
Rotation is best-effort. A single failure (transient register error, network blip) logs and waits for the next tick. The current box and creds remain serving in the failure case so a one-off transient doesn't kill the session.
Test plan
Related
Part of the Share My Connection security review series:
🤖 Generated with Claude Code