Stale $projections-$all read callback after Stop() repopulates _projections, causing permanent projection failure on next leader election


**Describe the bug**

When a node holds leadership briefly (tens of milliseconds) and then loses it, `ProjectionManager.Stop()` clears the `_projections` dictionary but does not cancel the in-flight `$projections-$all` read dispatched through `_readForwardDispatcher`. The stale read callback fires after Stop completes, repopulates `_projections` while `_started` is false. These zombie entries persist indefinitely in memory. If the same node later becomes leader again, the new `$projections-$all` read finds every projection already in `_projections` via `ContainsKey()`, logs "duplicate registration event" for each, and skips them all — resulting in zero running projections.

**KurrentDB details**

- KurrentDB server version: 23.10.2 (`oss-v23.10.2`). Code review indicates the same vulnerability exists through at least v26.0.0.
- Deployment: 5-node cluster on Kubernetes, DNS-based gossip discovery
- Operating system: Debian Bookworm (container)

---

## Observed behavior (from production logs)

All observations below are from the same EventStoreDB process (PID `370866`) on a single node. Timestamps are UTC.

### 1. Brief leadership at 03:48:54 — projection subsystem starts and stops within 30ms

The node won an election, held leadership for ~30ms, then lost it:

```
2026-03-27T03:48:54.383  IS LEADER... SPARTA!
2026-03-27T03:48:54.413  IS FOLLOWER... LEADER IS [eventstore-1]
```

The projection subsystem started, issued a read, then stopped:

```
03:48:54.384 [INF] PROJECTIONS SUBSYSTEM: Starting components for Instance: b94151fb-0f0e-4d82-8b36-71186474ac1b
03:48:54.385 [DBG] PROJECTIONS: Starting Projections Manager. Correlation: b94151fb-...
03:48:54.386 [DBG] PROJECTIONS: Reading Existing Projections from "$projections-$all"
03:48:54.387 [DBG] PROJECTIONS SUBSYSTEM: Component '"ProjectionManager"' started for Instance: b94151fb-...
03:48:54.404 [INF] PROJECTIONS SUBSYSTEM: All components started for Instance: b94151fb-...
03:48:54.404 [INF] PROJECTIONS SUBSYSTEM: Node state is no longer Leader. Stopping projections. Current node state: PreReplica
03:48:54.404 [INF] PROJECTIONS SUBSYSTEM: Stopping components for Instance: b94151fb-...
03:48:54.405 [DBG] PROJECTIONS: Stopping Projections Manager. Correlation b94151fb-...
03:48:54.406 [INF] PROJECTIONS SUBSYSTEM: IO Dispatcher from "ProjectionManager" has been drained. 3 of 4 queues empty.
03:48:54.408 [INF] PROJECTIONS SUBSYSTEM: All components stopped and dispatchers drained for Instance: b94151fb-...
```

### 2. Stale callback fires 328ms after Stop — repopulates `_projections`

After the subsystem reported "All components stopped", the `$projections-$all` read callback still fired and added all 6 projections:

```
03:48:54.733 [DBG] PROJECTIONS: Found the following projections in "$projections-$all": ["$by_category", "$stream_by_category", "$streams", "$by_event_type", "$by_correlation_id", "searchintegration"]
03:48:54.735 [DBG] Adding projection 20c111c6-...@"$by_category" to list
03:48:54.735 [DBG] Adding projection cb6deaa0-...@"$stream_by_category" to list
03:48:54.735 [DBG] Adding projection f0715542-...@"$streams" to list
03:48:54.735 [DBG] Adding projection 910adc2d-...@"$by_event_type" to list
03:48:54.735 [DBG] Adding projection 287a22d7-...@"$by_correlation_id" to list
03:48:54.735 [DBG] Adding projection bcdeafb4-...@"searchintegration" to list
```

Note the timestamps: Stop completed at `03:48:54.408`, stale callback fired at `03:48:54.733` — 328ms later, after the subsystem was fully stopped.

### 3. Node stays Follower for ~21 hours — zombie entries persist

The node remained a Follower from 03:48:54 until the next election:

```
03:48:54.413  IS FOLLOWER... LEADER IS [eventstore-1]
03:48:54.929  IS FOLLOWER... LEADER IS [eventstore-1]
21:10:47.182  IS FOLLOWER... LEADER IS [eventstore-3]
```

Same PID throughout — the process was never restarted, so the `_projections` dictionary retained the 6 zombie entries.

### 4. Node becomes leader 21 hours later — projections permanently fail

When the previous leader (eventstore-3) went down, this node won the election. The projection subsystem started with a single new correlation ID, but the `$projections-$all` read found every projection already in `_projections`:

```
2026-03-28T00:31:38.284 [DBG] PROJECTIONS SUBSYSTEM: Not stopping because subsystem is not in a started state. Current Subsystem state: Stopped
00:31:38.314 [INF] IS LEADER... SPARTA!
00:31:38.314 [INF] PROJECTIONS SUBSYSTEM: Starting components for Instance: 299e9a28-0ff4-4ccb-92f7-62a41283e8f6
00:31:38.314 [DBG] PROJECTIONS: Starting Projections Manager. Correlation: 299e9a28-...
00:31:38.314 [DBG] PROJECTIONS: Reading Existing Projections from "$projections-$all"
00:31:38.314 [DBG] PROJECTIONS SUBSYSTEM: Component '"ProjectionManager"' started for Instance: 299e9a28-...
00:31:38.315 [INF] PROJECTIONS SUBSYSTEM: All components started for Instance: 299e9a28-...
00:31:38.327 [WRN] PROJECTIONS: The following projection: "$by_category" has a duplicate registration event.
00:31:38.327 [WRN] PROJECTIONS: The following projection: "$stream_by_category" has a duplicate registration event.
00:31:38.327 [WRN] PROJECTIONS: The following projection: "$streams" has a duplicate registration event.
00:31:38.327 [WRN] PROJECTIONS: The following projection: "$by_event_type" has a duplicate registration event.
00:31:38.327 [WRN] PROJECTIONS: The following projection: "$by_correlation_id" has a duplicate registration event.
00:31:38.327 [WRN] PROJECTIONS: The following projection: "searchintegration" has a duplicate registration event.
00:31:38.327 [DBG] PROJECTIONS: Found the following projections in "$projections-$all": []
00:31:38.329 [DBG] PROJECTIONS: Conflict. Duplicate projection names : $streams, $stream_by_category, $by_category, $by_event_type, $by_correlation_id
```

Only one correlation ID (`299e9a28`), one Start, one read. No Start/Stop cycle during this election. The "duplicate registration event" warnings come from `_projections.ContainsKey()` returning true — the zombie entries from 21 hours earlier.

**Projections remained at zero. The node had to be killed and replaced with a fresh process to recover.**

### 5. Fresh process on a different node loads projections cleanly

When the affected node was killed, a different node (fresh process, PID `13582`) became leader and loaded all 6 projections without errors:

```
01:45:05.662 [INF] IS LEADER... SPARTA!
01:45:06.015 [DBG] PROJECTIONS: Found the following projections in "$projections-$all": ["$by_category", "$stream_by_category", "$streams", "$by_event_type", "$by_correlation_id", "searchintegration"]
01:45:06.018 [DBG] Adding projection 555d7743-...@"$by_category" to list
01:45:06.018 [DBG] Adding projection 7a63beaf-...@"$stream_by_category" to list
01:45:06.018 [DBG] Adding projection 4eb9c72f-...@"$streams" to list
01:45:06.018 [DBG] Adding projection 01221683-...@"$by_event_type" to list
01:45:06.018 [DBG] Adding projection fdc7b1dd-...@"$by_correlation_id" to list
01:45:06.018 [DBG] Adding projection 69f8df61-...@"searchintegration" to list
```

No duplicate registration warnings. A fresh process has an empty `_projections` dictionary, so the stale-callback problem does not apply.

---

## Code analysis (our reading of the source — may be incomplete)

We traced the following in `oss-v23.10.2`. The structure appears unchanged through `v26.0.0`.

### Why the stale callback survives Stop

[`ProjectionManager.Stop()`](https://github.com/kurrent-io/KurrentDB/blob/oss-v23.10.2/src/EventStore.Projections.Core/Services/Management/ProjectionManager.cs#L230-L239) clears `_projections` ([L236](https://github.com/kurrent-io/KurrentDB/blob/oss-v23.10.2/src/EventStore.Projections.Core/Services/Management/ProjectionManager.cs#L236)) and drains `_ioDispatcher` ([L233](https://github.com/kurrent-io/KurrentDB/blob/oss-v23.10.2/src/EventStore.Projections.Core/Services/Management/ProjectionManager.cs#L233)), but the `$projections-$all` read is dispatched through [`_readForwardDispatcher`](https://github.com/kurrent-io/KurrentDB/blob/oss-v23.10.2/src/EventStore.Projections.Core/Services/Management/ProjectionManager.cs#L84) — a separate `RequestResponseDispatcher`. Its [`CancelAll()`](https://github.com/kurrent-io/KurrentDB/blob/oss-v23.10.2/src/EventStore.Core/Messaging/RequestResponseDispatcher.cs#L126) method exists but is never called from `Stop()`.

### Why the stale callback executes blindly

[`OnProjectionsListReadCompleted`](https://github.com/kurrent-io/KurrentDB/blob/oss-v23.10.2/src/EventStore.Projections.Core/Services/Management/ProjectionManager.cs#L721-L777) has no `_instanceCorrelationId` guard. Contrast with [`Handle(ComponentStarted)`](https://github.com/kurrent-io/KurrentDB/blob/oss-v23.10.2/src/EventStore.Projections.Core/ProjectionsSubsystem.cs#L242) at the Subsystem level, which discards stale messages via correlation check. The read callback processes events and calls [`CreateManagedProjectionInstance`](https://github.com/kurrent-io/KurrentDB/blob/oss-v23.10.2/src/EventStore.Projections.Core/Services/Management/ProjectionManager.cs#L1212-L1237) which writes to [`_projections`](https://github.com/kurrent-io/KurrentDB/blob/oss-v23.10.2/src/EventStore.Projections.Core/Services/Management/ProjectionManager.cs#L1237) regardless of subsystem state.

### Why the zombie entries can never be cleaned up

After the stale callback repopulates `_projections`, `_started` is still false (set by `Stop()` at [L231](https://github.com/kurrent-io/KurrentDB/blob/oss-v23.10.2/src/EventStore.Projections.Core/Services/Management/ProjectionManager.cs#L231)). Any subsequent `Handle(StopComponents)` checks [`if (!_started) return`](https://github.com/kurrent-io/KurrentDB/blob/oss-v23.10.2/src/EventStore.Projections.Core/Services/Management/ProjectionManager.cs#L205) and bails out without calling `Stop()` again — so `_projections.Clear()` is never reached.

### Why the next Start fails

When the node later becomes leader, [`Handle(StartComponents)`](https://github.com/kurrent-io/KurrentDB/blob/oss-v23.10.2/src/EventStore.Projections.Core/Services/Management/ProjectionManager.cs#L184-L201) proceeds (since `_started` is false), issues a new `$projections-$all` read, and the callback checks [`_projections.ContainsKey(projectionName)`](https://github.com/kurrent-io/KurrentDB/blob/oss-v23.10.2/src/EventStore.Projections.Core/Services/Management/ProjectionManager.cs#L739) for each event. Since the zombie entries are present, every projection is treated as a duplicate and skipped ([L741-743](https://github.com/kurrent-io/KurrentDB/blob/oss-v23.10.2/src/EventStore.Projections.Core/Services/Management/ProjectionManager.cs#L741-L743)).

### Possible fixes (our suggestions — open to guidance)

1. **Cancel outstanding reads on Stop**: call `_readForwardDispatcher.CancelAll()` in `Stop()`.
2. **Add a correlation guard to the read callback**: capture `_instanceCorrelationId` when issuing the read in [`ReadProjectionsList`](https://github.com/kurrent-io/KurrentDB/blob/oss-v23.10.2/src/EventStore.Projections.Core/Services/Management/ProjectionManager.cs#L698-L718) and check it in [`OnProjectionsListReadCompleted`](https://github.com/kurrent-io/KurrentDB/blob/oss-v23.10.2/src/EventStore.Projections.Core/Services/Management/ProjectionManager.cs#L721). Discard the response if the IDs don't match.
3. **Clear `_projections` at the start of `Handle(StartComponents)`**: as a belt-and-suspenders measure, clear `_projections` before calling `StartExistingProjections()` so zombie entries from a prior instance are always wiped.

---

## To reproduce

1. Set up a 3+ node KurrentDB cluster.
2. Trigger a rapid leadership change on a node (e.g., a network partition that lasts ~20-50ms — long enough for `StartComponents` to issue the `$projections-$all` read, short enough for the read to still be in flight when `Stop()` is called).
3. Wait for the stale `$projections-$all` read callback to fire (watch for `Adding projection ... to list` log lines AFTER `All components stopped and dispatchers drained`).
4. Trigger a new leadership election on the same node (without restarting the process).
5. Observe: all projections report "duplicate registration event", `$projections-$all` reports `[]`, zero running projections.

The critical window is the time between `ReadStreamEventsForward` being dispatched and the response arriving. For a small `$projections-$all` stream, this is a few hundred milliseconds — but a brief leadership flap during cluster formation or gossip instability is enough.

**Expected behavior**

A stale `$projections-$all` read callback from a prior leadership term should not modify `_projections`. Projections should start correctly on every leader election regardless of prior leadership history on the same process.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stale $projections-$all read callback after Stop() repopulates _projections, causing permanent projection failure on next leader election #5574

Observed behavior (from production logs)

1. Brief leadership at 03:48:54 — projection subsystem starts and stops within 30ms

2. Stale callback fires 328ms after Stop — repopulates `_projections`

3. Node stays Follower for ~21 hours — zombie entries persist

4. Node becomes leader 21 hours later — projections permanently fail

5. Fresh process on a different node loads projections cleanly

Code analysis (our reading of the source — may be incomplete)

Why the stale callback survives Stop

Why the stale callback executes blindly

Why the zombie entries can never be cleaned up

Why the next Start fails

Possible fixes (our suggestions — open to guidance)

To reproduce

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Stale $projections-$all read callback after Stop() repopulates _projections, causing permanent projection failure on next leader election #5574

Description

Observed behavior (from production logs)

1. Brief leadership at 03:48:54 — projection subsystem starts and stops within 30ms

2. Stale callback fires 328ms after Stop — repopulates _projections

3. Node stays Follower for ~21 hours — zombie entries persist

4. Node becomes leader 21 hours later — projections permanently fail

5. Fresh process on a different node loads projections cleanly

Code analysis (our reading of the source — may be incomplete)

Why the stale callback survives Stop

Why the stale callback executes blindly

Why the zombie entries can never be cleaned up

Why the next Start fails

Possible fixes (our suggestions — open to guidance)

To reproduce

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. Stale callback fires 328ms after Stop — repopulates `_projections`