Skip to content

peer: rotate samizdat credentials hourly (closes engineering#3437)#472

Open
myleshorton wants to merge 4 commits into
fisk/peer-localbackendfrom
fisk/peer-cred-rotation
Open

peer: rotate samizdat credentials hourly (closes engineering#3437)#472
myleshorton wants to merge 4 commits into
fisk/peer-localbackendfrom
fisk/peer-cred-rotation

Conversation

@myleshorton
Copy link
Copy Markdown
Contributor

Summary

Closes the C2 finding from the Share My Connection security review (engineering#3437): peer.Client builds the libbox inbound exactly once per Start and holds the same samizdat creds for the entire peer process lifetime, so a leaked credential remains usable for hours or days.

This adds a `credRotationLoop` goroutine on a 1h tick that re-registers, rebuilds the libbox inbound with fresh creds, and deregisters the prior route. Caps blast radius from credential leakage to ~1h regardless of how long the peer has been running.

Sequence per rotation

  1. Re-register with lantern-cloud against the same (address, port) tuple — same router-side mapping, fresh server-side row, fresh samizdat creds
  2. Patch the new options for VPN bypass
  3. Build a new libbox service
  4. Close the old box (releases the listening port)
  5. Start the new box (re-binds the same port with new creds)
  6. Atomic swap of `c.box`, `c.routeID`
  7. Best-effort deregister of the prior route_id so the bandit stops handing the old (now-invalid) creds to clients within ~immediately rather than waiting up-to-TTL for the row to expire

Steps 4-5 leave a brief (~hundreds of ms) window where the port is unbound; samizdat clients see TCP RST and reconnect via the bandit. Acceptable trade-off vs. the security cost of indefinite cred lifetime.

Failure handling

Rotation is best-effort. A single failure (transient register error, network blip) logs and waits for the next tick. The current box and creds remain serving in the failure case so a one-off transient doesn't kill the session.

Test plan

  • `TestClient_RotatesCredentialsAtInterval`: drives a 50ms rotation interval and asserts ≥3 registers within 1s, deregister count tracks rotations, route_id advances past the initial value, multiple distinct boxes built, first box closed
  • All existing peer tests pass under `-race`
  • CI

Related

Part of the Share My Connection security review series:

  • ✅ engineering#3436 / lantern-cloud#2705 — egress filter (C1)
  • ✅ engineering#3437 (this PR) — cred rotation (C2)
  • 🔜 engineering#3438 — abuse-feedback path (C3)
  • 🔜 H1-H4 + M3 — follow-ups

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings May 8, 2026 02:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds periodic rotation of peer “Share My Connection” samizdat credentials by re-registering with lantern-cloud, rebuilding the libbox inbound with fresh server config, swapping the active route/box, and best-effort deregistering the prior route to reduce the impact window of leaked credentials.

Changes:

  • Introduces Config.CredRotationInterval and a credRotationLoop() goroutine that calls rotateCreds() on a schedule (default: 1 hour).
  • Persists port mapping details and run context on the Client so rotations can re-register against the same (external IP, port) tuple and rebuild libbox.
  • Adds TestClient_RotatesCredentialsAtInterval to verify repeated register/build/close behavior over a short interval.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File Description
peer/peer.go Adds credential rotation loop, stores mapping/run context for reuse, and implements re-register → rebuild → swap → deregister sequence.
peer/peer_test.go Extends stub server to vary register responses per call and adds a rotation-interval test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread peer/peer.go
Comment thread peer/peer.go
Comment thread peer/peer.go
Comment thread peer/peer.go
Comment thread peer/peer.go
Comment thread peer/peer.go Outdated
Comment thread peer/peer.go
Comment thread peer/peer_test.go Outdated
@myleshorton myleshorton force-pushed the fisk/peer-localbackend branch from 3b19ec0 to 8cb3713 Compare May 28, 2026 23:57
@myleshorton myleshorton force-pushed the fisk/peer-cred-rotation branch from dea5216 to 830832c Compare May 29, 2026 01:58
@myleshorton myleshorton force-pushed the fisk/peer-localbackend branch from 8cb3713 to 5959bbe Compare May 29, 2026 19:51
@myleshorton myleshorton force-pushed the fisk/peer-cred-rotation branch 3 times, most recently from 0e55277 to 8fe02d4 Compare May 29, 2026 20:17
@myleshorton myleshorton force-pushed the fisk/peer-localbackend branch from be86ee9 to 7d616d6 Compare May 29, 2026 20:43
Currently peer.Client builds the libbox inbound exactly once per Start
and holds the same X25519 keypair / shortID / masquerade for the
entire peer process lifetime — a leaked credential (logs, telemetry,
support bundles, the route_id leakage in engineering#3440) remains
usable for hours or days.

This adds a credRotationLoop goroutine on a 1h tick. On each tick:

  1. Re-register with lantern-cloud against the same (address, port)
     tuple — same router-side mapping, fresh server-side row, fresh
     samizdat creds.
  2. Patch the new options for VPN bypass.
  3. Build a new libbox service.
  4. Close the old box (releases the listening port).
  5. Start the new box (re-binds the same port with new creds).
  6. Atomic swap of c.box, c.routeID.
  7. Best-effort deregister of the prior route_id so the bandit
     stops handing the old (now-invalid) creds to clients within
     ~immediately rather than waiting up-to-TTL for the row to
     expire.

Steps 4-5 leave a brief (~hundreds of ms) window where the port is
unbound; samizdat clients see TCP RST and reconnect via the bandit.
That's the trade-off vs. the security cost of holding the same cred
for the peer process lifetime — caps blast radius from cred leakage
to ~1h regardless of how long the peer has been running.

Rotation is best-effort: a single failure logs and waits for the
next tick. The current box and creds remain serving in the failure
case so a transient register error doesn't kill the session.

Config gains CredRotationInterval (defaults to peerCredRotationInterval
= 1h) so tests drive the loop without a 1h sleep — see
TestClient_RotatesCredentialsAtInterval.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@myleshorton myleshorton force-pushed the fisk/peer-cred-rotation branch from 8fe02d4 to c995f4b Compare May 29, 2026 20:44
Eight findings on the cred-rotation lifecycle:

1. credRotationLoop time.NewTicker panic on non-positive interval. Now
   clamps interval <= 0 to peerCredRotationInterval (the same default
   Start applies when CredRotationInterval is unset) instead of letting
   NewTicker panic the host process.

2. rotateCreds raced Stop. Re-check c.active under the swap lock
   immediately before assigning c.box / c.routeID; if Stop has cleared
   that state in flight, close the newly-built box and deregister the
   new route rather than resurrecting peer state Stop just tore down.

3. heartbeatLoop 404 raced cred rotation. heartbeatLoop captured
   c.routeID, sent the heartbeat, and treated a 404 as authoritative —
   but a rotation between those two steps deregisters the captured
   route_id, producing a 404 that's actually a stale response and
   should not trigger auto-Stop. heartbeatLoop now re-checks the
   current routeID under lock on 404; if it differs, the 404 is
   expected and the loop continues.

4. Status.ExternalIP went stale after rotation. rotation re-queries
   ExternalIP for Register, so update c.status.ExternalIP alongside
   c.status.RouteID under the swap lock.

5. Newly-registered route leaked when BuildBoxService or the
   subsequent Start failed. Threaded a cleanupNewRoute(reason) closure
   through every post-Register error path so the orphan row is
   deregistered with a fresh ctx (avoids cancellation-skip) rather
   than leaking until TTL.

6. Long availability outage when newBox.Start failed after oldBox.Close
   released the port. New startNewBoxWithRetry retries Start up to 5
   times with exponential backoff (50ms → 800ms, total <1s) to absorb
   router-side TIME_WAIT / EADDRINUSE windows; preserves the previous
   'fall through to next tick' behavior only after the retries are
   exhausted.

7. libbox.Start panic could crash the host process during background
   rotation, taking the user's main VPN with it (vpn/tunnel.go's
   tunnel-start path has a parallel recover). New runRotation wrapper
   defers a recover + slog.Error and lets the loop continue.

8. TestClient_RotatesCredentialsAtInterval asserted
   deregisterCount >= rotations-1, which would pass even if one
   rotation never deregistered. Tightened to >= rotations via
   require.Eventually so the test waits for the most recent deregister
   to land (rotation issues it after the swap-lock completes, so
   there's a small race window between observing the new register and
   the corresponding deregister).

Tests pass under -race -count=1 across 5 consecutive runs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

Comment thread peer/peer.go Outdated
Comment thread peer/peer.go Outdated
Comment thread peer/peer_test.go Outdated
Comment thread peer/peer.go Outdated
Comment thread peer/peer.go
Comment thread peer/peer.go Outdated
Comment thread peer/peer.go Outdated
Seven follow-ups from the round-2 re-review:

1. startNewBoxWithRetry now wraps the libbox.Start call in a deferred
   recover that converts a panic into an error return. Without this,
   runRotation's recover one frame up catches the panic but only after
   rotateCreds' cleanupNewRoute path has been skipped — orphaning the
   freshly-registered route and leaving the port unbound.

2. Stop→Start race during rotation. The previous `!c.active` re-check
   at the swap lock missed the case where Stop cleared state AND a new
   Start re-set active=true before this rotation reached the swap; the
   old rotation would then overwrite the new session's box / routeID.
   rotateCreds now captures c.runCtx as sessionRunCtx at the top of
   the function and gates both the BuildBoxService call and the final
   swap on pointer-identity with the current c.runCtx — different ctx
   means a Stop→Start cycle replaced the session and our newBox / new
   route belong to the prior session, so close + deregister.

3. Post-swap deregister of the prior route now uses a fresh
   peerCleanupTimeout-bounded Background ctx instead of the rotation
   ctx. Stop cancelling that ctx between the swap and the deregister
   would have left the old (now-invalid) route in the server catalog
   until TTL — defeating the entire stale-credential cap the rotation
   is supposed to enforce.

4. startNewBoxWithRetry budget. The previous loop slept after every
   failed attempt, including the fifth (no point — we wouldn't try
   again), pushing total backoff to 1550ms when the comment claimed
   <1s. Now skips the final sleep, total budget 750ms (50+100+200+400
   across 4 sleeps), comment updated to match.

5. peerCredRotationInterval doc dropped an engineering#3440 reference
   that violated AGENTS.md:13-17 (no ticket refs in code comments).
   Kept the rationale ('H2 leakage paths') without naming the issue.

6. runRotation doc dropped the explicit 'vpn/tunnel.go' reference —
   the 'the main tunnel start path already wraps it with recover for
   the same reason' framing carries the same context without naming
   the file.

7. TestClient_RotatesCredentialsAtInterval doc dropped the
   engineering#3437 ticket reference; rewords as 'pins the rotation
   invariant'.

All tests pass under -race -count=1 across 5 consecutive runs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread peer/peer_test.go Outdated
Comment thread peer/peer.go Outdated
Two follow-ups from the round-3 re-review:

1. Test asserted against the internal c.routeID field directly. A
   future change that updated the routeID used by heartbeats but
   forgot to mirror it into Status.RouteID would still pass the
   test. Use CurrentStatus() so the assertion pins the observable
   contract.

2. rotateCreds docstring claimed 'On failure: the prior creds and box
   continue serving' across all failures, but a failure after
   oldBox.Close (startNewBoxWithRetry exhausts retries or panics)
   leaves the listener down until the next rotation tick rebinds.
   Narrow the docstring to distinguish failures before and after
   oldBox.Close; the router-side port mapping survives in both cases,
   only the in-process listener state differs.

No behavior change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants