Skip to content

The Raft consensus loop took the bytes of a committed log entry (or s…#5467

Open
neeelkhadwal wants to merge 1 commit into
hyperledger:mainfrom
neeelkhadwal:main
Open

The Raft consensus loop took the bytes of a committed log entry (or s…#5467
neeelkhadwal wants to merge 1 commit into
hyperledger:mainfrom
neeelkhadwal:main

Conversation

@neeelkhadwal
Copy link
Copy Markdown

  • Bug fix

Description

The etcdraft consensus chain calls protoutil.UnmarshalBlockOrPanic(...) in three places when decoding committed Raft entries and snapshot data:

  • orderer/consensus/etcdraft/chain.go:255 — in NewChain, on snapshot data read at startup
  • orderer/consensus/etcdraft/chain.go:1192 — in apply, on each committed EntryNormal
  • orderer/consensus/etcdraft/chain.go:1253 — in apply, when preparing a snapshot trigger

If the bytes do not decode as a common.Block protobuf, UnmarshalBlockOrPanic raises a Go panic and terminates the entire orderer process — taking down every channel hosted by that orderer, not just the affected chain.

This is a critical denial-of-service vector for two reasons:

  1. Cluster-wide crash from a single bad entry. Raft replicates committed entries byte-for-byte to every follower. A byzantine proposer (or any node compromised enough to get bytes accepted as a Raft proposal — cluster TLS authenticates the sender, not the payload)
    can cause every follower in the cluster to panic in lockstep on the same apply call.
  2. Permanent crashloop on corrupted local state. A power loss or disk fault that corrupts the on-disk snapshot/WAL puts the node into an unrecoverable startup crashloop, because NewChain re-reads the same bad snapshot on every restart and panics.

The fix replaces all three call sites with protoutil.UnmarshalBlock (the error-returning variant, which the same file already uses on line 1052) and handles the error path appropriately for each context:

  • In NewChain, the wrapped error is returned to the caller so the operator sees a real startup error instead of a crashloop.
  • In apply, since the function has no return and skipping a committed entry would silently fork this node's state from the rest of the cluster, the chain is halted via the existing c.halt() graceful-shutdown path. halt is invoked on a goroutine to avoid deadlocking
    against the serve loop — the same pattern is used a few lines below for the conf-change halt path. Other channels in the same orderer process are unaffected.

CWE-248 (Uncaught Exception) / CWE-754 (Improper Check for Unusual or Exceptional Conditions).

Additional details

Diff is 24 insertions / 3 deletions across a single file; no new imports, no API changes. protoutil.UnmarshalBlock is already used elsewhere in the same file, so this aligns the three holdouts with the established pattern.

Behavior change summary:

Scenario Before After
Malformed committed Raft entry Orderer process panics; all channels go down Affected chain logs the raft index, halts via haltCallback; other channels continue
Corrupted on-disk snapshot at startup NewChain panics → systemd/k8s crashloop NewChain returns a wrapped error; operator-visible failure
Healthy operation unchanged unchanged

Testing notes for reviewers:

  • Existing etcdraft unit tests should continue to pass; they exercise the success path.
  • Suggested new tests (not yet added): (a) unit test calling apply with an EntryNormal whose Data is random non-protobuf bytes and asserting haltCallback is invoked rather than panic; (b) unit test constructing NewChain against a MemoryStorage seeded with a snapshot
    containing garbage Data and asserting an error return.

Related issues

None — surfaced during a security review of the consensus path; no public issue filed yet.

@neeelkhadwal neeelkhadwal requested a review from a team as a code owner April 27, 2026 03:42
…napshot) and called protoutil.UnmarshalBlockOrPanic(...) on them. That function does what its name says — if the bytes don't decode as a common.Block protobuf, it raises a Go panic, which terminates

   the entire orderer process.

Signed-off-by: Anil Kumar <neeel@Anils-MacBook-Air.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant