Describe the bug
When a node holds leadership briefly (tens of milliseconds) and then loses it, ProjectionManager.Stop() clears the _projections dictionary but does not cancel the in-flight $projections-$all read dispatched through _readForwardDispatcher. The stale read callback fires after Stop completes, repopulates _projections while _started is false. These zombie entries persist indefinitely in memory. If the same node later becomes leader again, the new $projections-$all read finds every projection already in _projections via ContainsKey(), logs "duplicate registration event" for each, and skips them all — resulting in zero running projections.
KurrentDB details
- KurrentDB server version: 23.10.2 (
oss-v23.10.2). Code review indicates the same vulnerability exists through at least v26.0.0.
- Deployment: 5-node cluster on Kubernetes, DNS-based gossip discovery
- Operating system: Debian Bookworm (container)
Observed behavior (from production logs)
All observations below are from the same EventStoreDB process (PID 370866) on a single node. Timestamps are UTC.
1. Brief leadership at 03:48:54 — projection subsystem starts and stops within 30ms
The node won an election, held leadership for ~30ms, then lost it:
2026-03-27T03:48:54.383 IS LEADER... SPARTA!
2026-03-27T03:48:54.413 IS FOLLOWER... LEADER IS [eventstore-1]
The projection subsystem started, issued a read, then stopped:
03:48:54.384 [INF] PROJECTIONS SUBSYSTEM: Starting components for Instance: b94151fb-0f0e-4d82-8b36-71186474ac1b
03:48:54.385 [DBG] PROJECTIONS: Starting Projections Manager. Correlation: b94151fb-...
03:48:54.386 [DBG] PROJECTIONS: Reading Existing Projections from "$projections-$all"
03:48:54.387 [DBG] PROJECTIONS SUBSYSTEM: Component '"ProjectionManager"' started for Instance: b94151fb-...
03:48:54.404 [INF] PROJECTIONS SUBSYSTEM: All components started for Instance: b94151fb-...
03:48:54.404 [INF] PROJECTIONS SUBSYSTEM: Node state is no longer Leader. Stopping projections. Current node state: PreReplica
03:48:54.404 [INF] PROJECTIONS SUBSYSTEM: Stopping components for Instance: b94151fb-...
03:48:54.405 [DBG] PROJECTIONS: Stopping Projections Manager. Correlation b94151fb-...
03:48:54.406 [INF] PROJECTIONS SUBSYSTEM: IO Dispatcher from "ProjectionManager" has been drained. 3 of 4 queues empty.
03:48:54.408 [INF] PROJECTIONS SUBSYSTEM: All components stopped and dispatchers drained for Instance: b94151fb-...
2. Stale callback fires 328ms after Stop — repopulates _projections
After the subsystem reported "All components stopped", the $projections-$all read callback still fired and added all 6 projections:
03:48:54.733 [DBG] PROJECTIONS: Found the following projections in "$projections-$all": ["$by_category", "$stream_by_category", "$streams", "$by_event_type", "$by_correlation_id", "searchintegration"]
03:48:54.735 [DBG] Adding projection 20c111c6-...@"$by_category" to list
03:48:54.735 [DBG] Adding projection cb6deaa0-...@"$stream_by_category" to list
03:48:54.735 [DBG] Adding projection f0715542-...@"$streams" to list
03:48:54.735 [DBG] Adding projection 910adc2d-...@"$by_event_type" to list
03:48:54.735 [DBG] Adding projection 287a22d7-...@"$by_correlation_id" to list
03:48:54.735 [DBG] Adding projection bcdeafb4-...@"searchintegration" to list
Note the timestamps: Stop completed at 03:48:54.408, stale callback fired at 03:48:54.733 — 328ms later, after the subsystem was fully stopped.
3. Node stays Follower for ~21 hours — zombie entries persist
The node remained a Follower from 03:48:54 until the next election:
03:48:54.413 IS FOLLOWER... LEADER IS [eventstore-1]
03:48:54.929 IS FOLLOWER... LEADER IS [eventstore-1]
21:10:47.182 IS FOLLOWER... LEADER IS [eventstore-3]
Same PID throughout — the process was never restarted, so the _projections dictionary retained the 6 zombie entries.
4. Node becomes leader 21 hours later — projections permanently fail
When the previous leader (eventstore-3) went down, this node won the election. The projection subsystem started with a single new correlation ID, but the $projections-$all read found every projection already in _projections:
2026-03-28T00:31:38.284 [DBG] PROJECTIONS SUBSYSTEM: Not stopping because subsystem is not in a started state. Current Subsystem state: Stopped
00:31:38.314 [INF] IS LEADER... SPARTA!
00:31:38.314 [INF] PROJECTIONS SUBSYSTEM: Starting components for Instance: 299e9a28-0ff4-4ccb-92f7-62a41283e8f6
00:31:38.314 [DBG] PROJECTIONS: Starting Projections Manager. Correlation: 299e9a28-...
00:31:38.314 [DBG] PROJECTIONS: Reading Existing Projections from "$projections-$all"
00:31:38.314 [DBG] PROJECTIONS SUBSYSTEM: Component '"ProjectionManager"' started for Instance: 299e9a28-...
00:31:38.315 [INF] PROJECTIONS SUBSYSTEM: All components started for Instance: 299e9a28-...
00:31:38.327 [WRN] PROJECTIONS: The following projection: "$by_category" has a duplicate registration event.
00:31:38.327 [WRN] PROJECTIONS: The following projection: "$stream_by_category" has a duplicate registration event.
00:31:38.327 [WRN] PROJECTIONS: The following projection: "$streams" has a duplicate registration event.
00:31:38.327 [WRN] PROJECTIONS: The following projection: "$by_event_type" has a duplicate registration event.
00:31:38.327 [WRN] PROJECTIONS: The following projection: "$by_correlation_id" has a duplicate registration event.
00:31:38.327 [WRN] PROJECTIONS: The following projection: "searchintegration" has a duplicate registration event.
00:31:38.327 [DBG] PROJECTIONS: Found the following projections in "$projections-$all": []
00:31:38.329 [DBG] PROJECTIONS: Conflict. Duplicate projection names : $streams, $stream_by_category, $by_category, $by_event_type, $by_correlation_id
Only one correlation ID (299e9a28), one Start, one read. No Start/Stop cycle during this election. The "duplicate registration event" warnings come from _projections.ContainsKey() returning true — the zombie entries from 21 hours earlier.
Projections remained at zero. The node had to be killed and replaced with a fresh process to recover.
5. Fresh process on a different node loads projections cleanly
When the affected node was killed, a different node (fresh process, PID 13582) became leader and loaded all 6 projections without errors:
01:45:05.662 [INF] IS LEADER... SPARTA!
01:45:06.015 [DBG] PROJECTIONS: Found the following projections in "$projections-$all": ["$by_category", "$stream_by_category", "$streams", "$by_event_type", "$by_correlation_id", "searchintegration"]
01:45:06.018 [DBG] Adding projection 555d7743-...@"$by_category" to list
01:45:06.018 [DBG] Adding projection 7a63beaf-...@"$stream_by_category" to list
01:45:06.018 [DBG] Adding projection 4eb9c72f-...@"$streams" to list
01:45:06.018 [DBG] Adding projection 01221683-...@"$by_event_type" to list
01:45:06.018 [DBG] Adding projection fdc7b1dd-...@"$by_correlation_id" to list
01:45:06.018 [DBG] Adding projection 69f8df61-...@"searchintegration" to list
No duplicate registration warnings. A fresh process has an empty _projections dictionary, so the stale-callback problem does not apply.
Code analysis (our reading of the source — may be incomplete)
We traced the following in oss-v23.10.2. The structure appears unchanged through v26.0.0.
Why the stale callback survives Stop
ProjectionManager.Stop() clears _projections (L236) and drains _ioDispatcher (L233), but the $projections-$all read is dispatched through _readForwardDispatcher — a separate RequestResponseDispatcher. Its CancelAll() method exists but is never called from Stop().
Why the stale callback executes blindly
OnProjectionsListReadCompleted has no _instanceCorrelationId guard. Contrast with Handle(ComponentStarted) at the Subsystem level, which discards stale messages via correlation check. The read callback processes events and calls CreateManagedProjectionInstance which writes to _projections regardless of subsystem state.
Why the zombie entries can never be cleaned up
After the stale callback repopulates _projections, _started is still false (set by Stop() at L231). Any subsequent Handle(StopComponents) checks if (!_started) return and bails out without calling Stop() again — so _projections.Clear() is never reached.
Why the next Start fails
When the node later becomes leader, Handle(StartComponents) proceeds (since _started is false), issues a new $projections-$all read, and the callback checks _projections.ContainsKey(projectionName) for each event. Since the zombie entries are present, every projection is treated as a duplicate and skipped (L741-743).
Possible fixes (our suggestions — open to guidance)
- Cancel outstanding reads on Stop: call
_readForwardDispatcher.CancelAll() in Stop().
- Add a correlation guard to the read callback: capture
_instanceCorrelationId when issuing the read in ReadProjectionsList and check it in OnProjectionsListReadCompleted. Discard the response if the IDs don't match.
- Clear
_projections at the start of Handle(StartComponents): as a belt-and-suspenders measure, clear _projections before calling StartExistingProjections() so zombie entries from a prior instance are always wiped.
To reproduce
- Set up a 3+ node KurrentDB cluster.
- Trigger a rapid leadership change on a node (e.g., a network partition that lasts ~20-50ms — long enough for
StartComponents to issue the $projections-$all read, short enough for the read to still be in flight when Stop() is called).
- Wait for the stale
$projections-$all read callback to fire (watch for Adding projection ... to list log lines AFTER All components stopped and dispatchers drained).
- Trigger a new leadership election on the same node (without restarting the process).
- Observe: all projections report "duplicate registration event",
$projections-$all reports [], zero running projections.
The critical window is the time between ReadStreamEventsForward being dispatched and the response arriving. For a small $projections-$all stream, this is a few hundred milliseconds — but a brief leadership flap during cluster formation or gossip instability is enough.
Expected behavior
A stale $projections-$all read callback from a prior leadership term should not modify _projections. Projections should start correctly on every leader election regardless of prior leadership history on the same process.
Describe the bug
When a node holds leadership briefly (tens of milliseconds) and then loses it,
ProjectionManager.Stop()clears the_projectionsdictionary but does not cancel the in-flight$projections-$allread dispatched through_readForwardDispatcher. The stale read callback fires after Stop completes, repopulates_projectionswhile_startedis false. These zombie entries persist indefinitely in memory. If the same node later becomes leader again, the new$projections-$allread finds every projection already in_projectionsviaContainsKey(), logs "duplicate registration event" for each, and skips them all — resulting in zero running projections.KurrentDB details
oss-v23.10.2). Code review indicates the same vulnerability exists through at least v26.0.0.Observed behavior (from production logs)
All observations below are from the same EventStoreDB process (PID
370866) on a single node. Timestamps are UTC.1. Brief leadership at 03:48:54 — projection subsystem starts and stops within 30ms
The node won an election, held leadership for ~30ms, then lost it:
The projection subsystem started, issued a read, then stopped:
2. Stale callback fires 328ms after Stop — repopulates
_projectionsAfter the subsystem reported "All components stopped", the
$projections-$allread callback still fired and added all 6 projections:Note the timestamps: Stop completed at
03:48:54.408, stale callback fired at03:48:54.733— 328ms later, after the subsystem was fully stopped.3. Node stays Follower for ~21 hours — zombie entries persist
The node remained a Follower from 03:48:54 until the next election:
Same PID throughout — the process was never restarted, so the
_projectionsdictionary retained the 6 zombie entries.4. Node becomes leader 21 hours later — projections permanently fail
When the previous leader (eventstore-3) went down, this node won the election. The projection subsystem started with a single new correlation ID, but the
$projections-$allread found every projection already in_projections:Only one correlation ID (
299e9a28), one Start, one read. No Start/Stop cycle during this election. The "duplicate registration event" warnings come from_projections.ContainsKey()returning true — the zombie entries from 21 hours earlier.Projections remained at zero. The node had to be killed and replaced with a fresh process to recover.
5. Fresh process on a different node loads projections cleanly
When the affected node was killed, a different node (fresh process, PID
13582) became leader and loaded all 6 projections without errors:No duplicate registration warnings. A fresh process has an empty
_projectionsdictionary, so the stale-callback problem does not apply.Code analysis (our reading of the source — may be incomplete)
We traced the following in
oss-v23.10.2. The structure appears unchanged throughv26.0.0.Why the stale callback survives Stop
ProjectionManager.Stop()clears_projections(L236) and drains_ioDispatcher(L233), but the$projections-$allread is dispatched through_readForwardDispatcher— a separateRequestResponseDispatcher. ItsCancelAll()method exists but is never called fromStop().Why the stale callback executes blindly
OnProjectionsListReadCompletedhas no_instanceCorrelationIdguard. Contrast withHandle(ComponentStarted)at the Subsystem level, which discards stale messages via correlation check. The read callback processes events and callsCreateManagedProjectionInstancewhich writes to_projectionsregardless of subsystem state.Why the zombie entries can never be cleaned up
After the stale callback repopulates
_projections,_startedis still false (set byStop()at L231). Any subsequentHandle(StopComponents)checksif (!_started) returnand bails out without callingStop()again — so_projections.Clear()is never reached.Why the next Start fails
When the node later becomes leader,
Handle(StartComponents)proceeds (since_startedis false), issues a new$projections-$allread, and the callback checks_projections.ContainsKey(projectionName)for each event. Since the zombie entries are present, every projection is treated as a duplicate and skipped (L741-743).Possible fixes (our suggestions — open to guidance)
_readForwardDispatcher.CancelAll()inStop()._instanceCorrelationIdwhen issuing the read inReadProjectionsListand check it inOnProjectionsListReadCompleted. Discard the response if the IDs don't match._projectionsat the start ofHandle(StartComponents): as a belt-and-suspenders measure, clear_projectionsbefore callingStartExistingProjections()so zombie entries from a prior instance are always wiped.To reproduce
StartComponentsto issue the$projections-$allread, short enough for the read to still be in flight whenStop()is called).$projections-$allread callback to fire (watch forAdding projection ... to listlog lines AFTERAll components stopped and dispatchers drained).$projections-$allreports[], zero running projections.The critical window is the time between
ReadStreamEventsForwardbeing dispatched and the response arriving. For a small$projections-$allstream, this is a few hundred milliseconds — but a brief leadership flap during cluster formation or gossip instability is enough.Expected behavior
A stale
$projections-$allread callback from a prior leadership term should not modify_projections. Projections should start correctly on every leader election regardless of prior leadership history on the same process.