CajunSystems · contrasam · Apr 11, 2026 · Apr 11, 2026 · Apr 11, 2026 · Apr 11, 2026
diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md
@@ -122,3 +122,142 @@ Roux upgraded to 0.2.1, AutoCloseable fix, concurrency/resource/timeout examples
 ## ~~Milestone 4: Doc Audit & v0.7.0 Release~~ ✅ `v0.7.0`
 
 Doc audit (25 files), README rewrite, 4 legacy effect docs archived, version bumped 0.4.0→0.7.0, 629 tests green. → [Archive](.planning/milestones/v0.7.0-ROADMAP.md)
+
+---
+
+---
+
+## Milestone 5: Cluster Evaluation & Enhancement `v0.8.0`
+
+**Phases**: 22–31 | **Branch**: `feature/roux-effect-integration`
+
+Evaluate and harden the cluster module. Introduce a pluggable serialization layer to replace Java native serialization throughout messaging and persistence. Add Redis-backed shared persistence so StatefulActors survive node reassignment. Harden cluster reliability, observability, and operations.
+
+---
+
+### ~~Phase 22: Cluster & Persistence Audit~~ ✅
+
+**Goal**: Written findings document covering gaps, risks, and test coverage blind spots in both the cluster module and the persistence system.
+
+Plans:
+- 22.1 Audit `ClusterActorSystem`, `ReliableMessagingSystem`, `MessageTracker`, `EtcdMetadataStore`, `RendezvousHashing` — document design gaps and risks
+- 22.2 Audit `StatefulActor` recovery + `PersistenceProvider` impls — demonstrate the state-loss-on-reassignment bug with a failing test
+- 22.3 Audit existing cluster test coverage — identify blind spots (node failure, message ordering, split-brain)
+- 22.4 Produce findings document summarising all issues, prioritised for the phases ahead
+
+---
+
+### ~~Phase 23: Serialization Framework~~ ✅
+
+**Goal**: Pluggable `SerializationProvider` interface with Kryo and JSON implementations, wired into `ReliableMessagingSystem`, `FileMessageJournal`, and `LmdbMessageJournal`. Message types no longer required to implement `Serializable`.
+
+Plans:
+- 23.1 Evaluate Kryo vs Jackson vs protobuf — pick Kryo as primary (performance, no-schema) and Jackson JSON as secondary (debuggability)
+- 23.2 Design and implement `SerializationProvider` interface: `byte[] serialize(Object)` / `<T> T deserialize(byte[], Class<T>)`
+- 23.3 Implement `KryoSerializationProvider` and `JsonSerializationProvider`
+- 23.4 Wire `SerializationProvider` into `ReliableMessagingSystem` (inter-node messages) and `FileMessageJournal` / `LmdbMessageJournal` (persistence)
+- 23.5 Tests: round-trip serialization, cross-version compatibility, verify `Serializable` constraint removed from message types
+
+---
+
+### ~~Phase 24: Redis Persistence Design~~ ✅
+
+**Goal**: Documented schema design for Redis-backed journal and snapshot stores, with evaluated tradeoffs vs existing file and LMDB providers.
+
+Plans:
+- 24.1 Evaluate Redis data structures for journal (Streams vs Lists vs Sorted Sets) and snapshot (Hash vs String vs JSON)
+- 24.2 Design key namespace: `cajun:journal:{actorId}` and `cajun:snapshot:{actorId}`
+- 24.3 Evaluate Redis persistence modes (RDB, AOF, no persistence) and their impact on actor durability guarantees
+- 24.4 Document tradeoffs: Redis vs LMDB vs file — latency, throughput, cross-node access, operational complexity
+- 24.5 Choose Redis client library (Lettuce vs Jedis) and add dependency to `cajun-persistence/build.gradle`
+
+---
+
+### ~~Phase 25: Redis Persistence Provider~~ ✅
+
+**Goal**: `RedisPersistenceProvider`, `RedisMessageJournal`, and `RedisSnapshotStore` implemented, tested, and registered in `PersistenceProviderRegistry`. Uses `SerializationProvider` from Phase 23.
+
+Plans:
+- 25.1 Implement `RedisMessageJournal<M>`: `append`, `readFrom`, `truncateBefore`, `getHighestSequenceNumber` using Redis Streams
+- 25.2 Implement `RedisSnapshotStore<S>`: `saveSnapshot`, `getLatestSnapshot`, `deleteSnapshots` using Redis Hash/String
+- 25.3 Implement `RedisPersistenceProvider` factory; register as `"redis"` in `PersistenceProviderRegistry`
+- 25.4 Unit tests: round-trip journal append/read, snapshot save/load, truncation, sequence number tracking
+- 25.5 Integration tests: `StatefulActor` with Redis provider — full journal replay and snapshot recovery
+
+---
+
+### ~~Phase 26: Cluster + Shared Persistence Integration~~ ✅
+
+**Goal**: `ClusterActorSystem` uses Redis-backed persistence so StatefulActors recover their full state when reassigned to a new node. `PidRehydrator` verified cross-node. Benchmark Redis vs LMDB vs file.
+
+Plans:
+- 26.1 Wire `RedisPersistenceProvider` as the default persistence provider in `ClusterActorSystem` startup
+- 26.2 Add persistence health check to cluster startup — fail fast if Redis unreachable
+- 26.3 Verify `PidRehydrator` correctly resolves `Pid` references across nodes after actor migration
+- 26.4 Integration test: actor on node A accumulates state → node A killed → actor reassigned to node B → verify full state recovered from Redis
+- 26.5 Benchmark: message throughput and recovery latency for Redis vs LMDB vs file-based persistence
+
+---
+
+### ~~Phase 27: Observability & Diagnostics~~ ✅
+
+**Goal**: Metrics API exposing throughput, latency, and cluster health; structured logging throughout cluster and persistence code.
+
+Plans:
+- 27.1 Define `ClusterMetrics` API: messages routed (local vs remote), routing latency, node count, actor count per node
+- 27.2 Define `PersistenceMetrics` API: journal append latency, snapshot save/load latency, provider health status
+- 27.3 Implement metrics collection in `ClusterActorSystem` and `ReliableMessagingSystem`
+- 27.4 Add cluster health check: `ClusterActorSystem.healthCheck()` returning node liveness, leader status, persistence reachability
+- 27.5 Improve structured logging (MDC or structured log fields) throughout cluster and persistence paths
+
+---
+
+### ~~Phase 28: Reliability Hardening~~ ✅
+
+**Goal**: Circuit breaker for node-to-node calls; exponential backoff for metadata store operations; graceful degradation when etcd is unavailable.
+
+Plans:
+- 28.1 Implement circuit breaker for `ReliableMessagingSystem` — open on repeated node failures, half-open probe, close on recovery
+- 28.2 Add exponential backoff with jitter to `EtcdMetadataStore` retry logic for transient failures
+- 28.3 Graceful degradation: define behaviour when etcd is unreachable (continue routing with cached assignments vs fail-fast)
+- 28.4 Improve error messages throughout cluster code — include node IDs, actor IDs, and failure context
+- 28.5 Tests: circuit breaker state transitions, retry backoff behaviour, degraded-mode routing
+
+---
+
+### Phase 29: Performance Optimization
+
+**Goal**: Local actor-location cache reduces etcd round-trips; EtcdMetadataStore uses connection pooling; batch actor registration on startup.
+
+Plans:
+- 29.1 Implement local actor-location cache in `ClusterActorSystem` with TTL-based invalidation and watcher-driven eviction
+- 29.2 Add connection pooling to `EtcdMetadataStore` (configure pool size, idle timeout)
+- 29.3 Batch actor registration on node startup — single bulk write instead of one etcd put per actor
+- 29.4 Profile hot paths under load — identify any remaining bottlenecks in routing or messaging
+- 29.5 Benchmarks: routing throughput and latency with and without caching; registration time with batch vs sequential
+
+---
+
+### Phase 30: Cluster Management API
+
+**Goal**: Programmatic API for cluster operations — node listing, actor migration, node draining. Config builder for simpler cluster setup.
+
+Plans:
+- 30.1 Implement `ClusterManagementApi`: `listNodes()`, `listActors(nodeId)`, `migrateActor(actorId, targetNodeId)`, `drainNode(nodeId)`
+- 30.2 Implement `ClusterConfiguration` builder — fluent API replacing raw constructor args for `ClusterActorSystem`
+- 30.3 Node drain: stop accepting new actors, migrate existing actors gracefully, signal completion
+- 30.4 Actor migration: move actor assignment in metadata store, trigger recovery on target node, verify state continuity
+- 30.5 Tests: drain-and-rejoin cycle, forced migration with state verification
+
+---
+
+### Phase 31: Testing, Documentation & Examples
+
+**Goal**: Chaos tests, cross-node state recovery tests, serialization migration guide, and production deployment guide.
+
+Plans:
+- 31.1 Chaos tests: kill random nodes under load, verify actors recover state from Redis, verify message delivery guarantees hold
+- 31.2 Cross-node state recovery test: stateful actor accumulates state across 100 messages → node killed → recovered on new node → state verified
+- 31.3 Serialization migration guide: how to move from Java native serialization to Kryo/JSON for existing actors
+- 31.4 Production deployment guide: etcd setup, Redis setup, cluster configuration, persistence configuration, observability
+- 31.5 Runnable cluster example: two-node setup with StatefulActor, Redis persistence, node failure and recovery demonstrated
diff --git a/.planning/STATE.md b/.planning/STATE.md
@@ -1,13 +1,28 @@
 # Project State
 
 ## Current Status
-**Milestone**: 4 — Doc Audit & v0.7.0 Release
-**Phase**: 21 ✅ Complete — Milestone 4 archived as v0.7.0
-**Status**: Milestone complete — 629 tests, 0 failures, all audits clean
+**Milestone**: 5 — Cluster Evaluation & Enhancement
+**Phase**: 29 — Planned
+**Status**: Phase 29 plan written (29-1-PLAN.md) — ready to execute
 **Branch**: feature/roux-effect-integration
-**Last Updated**: 2026-04-01
+**Last Updated**: 2026-04-11
 
-## Milestone 4 Phase Progress
+## Milestone 5 Phase Progress
+
+| Phase | Name | Status |
+|-------|------|--------|
+| 22 | Cluster & Persistence Audit | ✅ Complete |
+| 23 | Serialization Framework | ✅ Complete |
+| 24 | Redis Persistence Design | ✅ Complete |
+| 25 | Redis Persistence Provider | ✅ Complete |
+| 26 | Cluster + Shared Persistence Integration | ✅ Complete |
+| 27 | Observability & Diagnostics | ✅ Complete |
+| 28 | Reliability Hardening | ✅ Complete |
+| 29 | Performance Optimization | 📋 Planned |
+| 30 | Cluster Management API | 🔲 Not started |
+| 31 | Testing, Documentation & Examples | 🔲 Not started |
+
+## Milestone 4 Phase Progress (archived — v0.7.0)
 
 | Phase | Name | Status |
 |-------|------|--------|
@@ -102,6 +117,34 @@
 - `Effect.generate()` requires `.widen()` on the handler — pass `handler.widen()`, not the raw handler
 - `CapabilityHandler.compose(h1, h2, h3)` accepts raw unwidened handlers; returns `CapabilityHandler<Capability<?>>`
 
+## Decisions Made (Milestone 5 — Phase 28)
+- `NodeCircuitBreaker` per-node (not per-actor): one node failure blocks all messages to that node; `failureThreshold=5`, `resetTimeoutMs=30s` defaults
+- Circuit breaker implemented with `synchronized` + `volatile` — simpler than lock-free for low-contention send path
+- `ExponentialBackoff` wraps only idempotent etcd ops (`put/get/delete/listKeys`); `acquireLock` excluded (double-acquire risk); watch/connect/close excluded
+- Graceful degradation via `exceptionally()` on `metadataStore.get()` future — zero overhead on happy path; WARN on cache hit, ERROR on cache miss (message dropped)
+- `DegradedRoutingTest` key prefix: `ClusterActorSystem.ACTOR_ASSIGNMENT_PREFIX` is `"cajun/actor/"` — test corrected to match
+
+## Decisions Made (Milestone 5 — Phase 27)
+- `ClusterMetrics` and `PersistenceMetrics` placed in `cajun-core/src/main/java/com/cajunsystems/metrics/` — `ReliableMessagingSystem` is in `cajun-core` so metrics must be co-located
+- `ClusterMetrics` injected into `ReliableMessagingSystem` via optional setter `setClusterMetrics()` with null guards — two copies of `ReliableMessagingSystem` exist (cajun-core + lib), both updated
+- `ClusterHealthStatus` record: `healthy = persistenceHealthy && messagingSystemRunning`; `persistenceHealthy=true` when no provider configured (backward compat)
+- MDC cleared via try-finally in `doSendMessage()` and `handleClient()` — prevents leakage on exception
+- `logback.xml` pattern: `[%X{actorId}][%X{messageId}]` added — empty strings for non-cluster log lines
+
+## Decisions Made (Milestone 5 — Phase 26)
+- `ClusterActorSystem.withPersistenceProvider(PersistenceProvider)` fluent setter; `setupPersistence()` called in `start()` before heartbeat/leader election — no-op if null
+- Persistence health check at startup: WARN log (not fail-fast) to preserve backward compat
+- `StatefulActorClusterStateTest`: @Disabled removed, `@Tag("requires-redis")` added, shared `RedisPersistenceProvider` used for both nodes — test now asserts count=6
+- Original bug-doc test kept `@Disabled` at method level as historical documentation
+- `PersistenceBenchmarkTest`: `@Tag("performance")` only; Redis tests also `@Tag("requires-redis")`; N=500 messages; no latency SLA assertions
+
+## Decisions Made (Milestone 5 — Phase 25)
+- Redis journal key: `{prefix}:journal:{actorId}` (actorId in `{}` for Cluster co-location); seq counter: `{prefix}:journal:{actorId}:seq`
+- Redis snapshot key: `{prefix}:snapshot:{actorId}` — single key per actor, overwrite semantics
+- `RedisPersistenceProvider` defaults to `JavaSerializationProvider`; integration tests use `KryoSerializationProvider` explicitly
+- Mocking Lettuce `RedisFuture` in tests: use concrete anonymous `RedisFuture` implementation wrapping `CompletableFuture` — avoids Mockito strict-stubbing issues with default `CompletionStage` interface methods
+- Integration tests tagged `@Tag("requires-redis")` — excluded from default Gradle test task in both `cajun-persistence` and `lib`
+
 ## Decisions Made (Milestone 2)
 - Audience: Cajun library users (self-contained, easy to run)
 - Stateful approach: ask-pattern composition, not AtomicReference-inside-effect