Skip to content

docs: design for graceful upgrade-halt contract#8

Merged
bdchatham merged 2 commits intomainfrom
design/upgrade-shutdown-contract
Apr 26, 2026
Merged

docs: design for graceful upgrade-halt contract#8
bdchatham merged 2 commits intomainfrom
design/upgrade-shutdown-contract

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

@bdchatham bdchatham commented Apr 26, 2026

Summary

First-draft issue + design under docs/upgrade-shutdown-contract/ for a graceful, distinguishable shutdown signal when seid hits an upgrade-required halt condition. Today this is a Go panic() (three sites in sei-cosmos/x/upgrade/abci.go), indistinguishable to any process supervisor from a genuine crash. The proposal:

  • A small contract in sei-config: ShutdownReason enum, exit-code constants (70/71/72), a typed HaltIntent struct, and a ParseExitCode helper. ~30-50 lines plus a unit test.
  • A two-tier architecture for the live signal: seid serves a primitive /halt_intent (Tier 1, sidecar-optional); a future controller-side sidecar exposes an opinionated aggregated /status (Tier 2, not in scope).
  • An opt-in stay-alive mode where the seid process keeps its servers up after consensus halts, so a control plane can read the halt intent before the process is gone. Default behavior unchanged from today (clean exit instead of panic stack trace, distinct exit code).

No code in this PR — first draft for review and iteration. The producer (sei-chain) and consumer (sei-k8s-controller) changes ship as follow-ups. This PR captures the contract shape and the coordination story.

Files

  • docs/upgrade-shutdown-contract/ISSUE.md — problem statement, customer/JTBD, scope, done criteria, placement caveat.
  • docs/upgrade-shutdown-contract/DESIGN.md — full design with goals, two-tier architecture, the endpoint-placement decision, sei-config surface, producer-side sketch, open questions, cross-repo coordination.

Test plan

  • Reviewers read both docs end-to-end
  • Land the endpoint-placement question (or surface a third option)
  • Confirm scope cut (no UpgradeInfo reader, no halt-intent.json, no termination-log JSON)
  • Confirm exit-code allocation (70-72, 73-79 reserved, 80-89 reserved for future non-upgrade graceful halts)
  • Confirm placement in sei-config is acceptable given the leaf-library charter, with the merge gate (follow-up sei-chain + sei-k8s-controller issues filed first)
  • Once aligned: I'll iterate this PR with revisions, then file follow-up issues on sei-chain and sei-k8s-controller before merge

🤖 Generated with Claude Code

Adds a first-draft issue + design under docs/upgrade-shutdown-contract/
covering an exit-code and HaltIntent contract for seid expected halts
(operator-action-required vs genuine crash). Captures the cross-repo
coordination story (sei-config = contract, sei-chain = producer,
sei-k8s-controller = consumer) and the explicit case for a dedicated
/halt_intent route over extending /status.

No code yet — first draft for review and iteration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread docs/upgrade-shutdown-contract/DESIGN.md Outdated
Previous draft recommended a dedicated /halt_intent route on
failure-domain-isolation grounds. PR review pushed back: the "monitoring
fleet polling /status" premise doesn't hold for Sei's actual stack.

Audit confirmed: every /status consumer in the workspace is integration
tests or thin orchestration (sei-tendermint/rpc/test/helpers.go,
sei-cosmos/contrib/localnet_liveness.sh, networks/remote/integration.sh,
autobahn explicitly avoids it). No production code in seid polls /status.
No Prometheus/Grafana/Tenderduty config in workspace. Cosmovisor scans
stderr.

What's left of the failure-isolation concern is defensive programming —
solvable with defer recover() in the /status handler's halt-intent
population path, not requiring a separate endpoint.

Updated:
- Architecture (Tier 1) describes the field, not the route
- Decision section rewritten with the audit findings and concession
- Endpoint section retitled and rewritten to describe the field
- Cross-repo coordination updated
- ISSUE.md scope line updated

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bdchatham
Copy link
Copy Markdown
Collaborator Author

Merge gate satisfied. Follow-up tracking issues filed:

Both reference back to #9 (this contract issue) and the design doc in this PR.

@bdchatham bdchatham merged commit a6a33f3 into main Apr 26, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant