Skip to content
Open
Show file tree
Hide file tree
Changes from 46 commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
9c9f1bf
plan: define Milestone 5 — Cluster Evaluation & Enhancement (v0.8.0)
contrasam Apr 11, 2026
ac2ceab
test(22-1): add disabled failing test for state-loss-on-reassignment bug
contrasam Apr 11, 2026
6219f76
docs(22-1): audit findings for cluster and persistence modules
contrasam Apr 11, 2026
146c65b
docs(22-1): complete cluster & persistence audit plan
contrasam Apr 11, 2026
6b48f91
feat(23-1): add SerializationProvider interface with Kryo and JSON im…
contrasam Apr 11, 2026
83a72dc
feat(23-1): wire SerializationProvider into ReliableMessagingSystem
contrasam Apr 11, 2026
7d542c5
feat(23-1): wire SerializationProvider into FileMessageJournal and Fi…
contrasam Apr 11, 2026
d5f8cde
feat(23-1): wire SerializationProvider into LMDB journals and snapsho…
contrasam Apr 11, 2026
825acff
test(23-1): SerializationProvider round-trip tests for Kryo, JSON, an…
contrasam Apr 11, 2026
ea565a5
docs(23-1): complete serialization framework phase
contrasam Apr 11, 2026
d10f265
docs(24-1): Redis persistence schema design document
contrasam Apr 11, 2026
18b0969
chore(24-1): add Lettuce 6.3.2.RELEASE dependency to cajun-persistenc…
contrasam Apr 11, 2026
33b514d
docs(24-1): complete Redis persistence design plan
contrasam Apr 11, 2026
e3e0a1a
feat(25-1): implement RedisMessageJournal with Lua atomic append
contrasam Apr 11, 2026
6f4876e
feat(25-1): implement RedisSnapshotStore with single-key overwrite se…
contrasam Apr 11, 2026
a8a41c6
feat(25-1): implement RedisPersistenceProvider; exclude requires-redi…
contrasam Apr 11, 2026
7b8048b
test(25-1): unit tests for Redis journal and snapshot with mocked Let…
contrasam Apr 11, 2026
1d0d569
test(25-1): integration tests for Redis journal, snapshot, and Statef…
contrasam Apr 11, 2026
825776a
docs(25-1): complete Redis Persistence Provider plan
contrasam Apr 11, 2026
80b6a9c
feat(26-1): add withPersistenceProvider() to ClusterActorSystem with …
contrasam Apr 11, 2026
91748dc
test(26-1): unit tests for PidRehydrator in cluster/recovery context
contrasam Apr 11, 2026
45dbd6c
fix(26-1): fix StatefulActorClusterStateTest — state recovered via Re…
contrasam Apr 11, 2026
c5616cb
test(26-1): persistence throughput benchmark — file vs Redis journal …
contrasam Apr 11, 2026
1a12433
docs(26-1): complete Cluster + Shared Persistence Integration plan
contrasam Apr 11, 2026
fdc2c9b
plan(27-1): define Phase 27 — Observability & Diagnostics
contrasam Apr 11, 2026
583ebb7
feat(27-1): add ClusterMetrics — routing and node counters
contrasam Apr 11, 2026
a686231
feat(27-1): add PersistenceMetrics and PersistenceMetricsRegistry
contrasam Apr 11, 2026
4188e3f
feat(27-1): wire ClusterMetrics into ClusterActorSystem and ReliableM…
contrasam Apr 11, 2026
50b901f
feat(27-1): add ClusterHealthStatus record and ClusterActorSystem.hea…
contrasam Apr 11, 2026
22f71cd
feat(27-1): add MDC structured logging to ReliableMessagingSystem and…
contrasam Apr 11, 2026
d552fb2
test(27-1): unit tests for ClusterMetrics, PersistenceMetrics, and Cl…
contrasam Apr 11, 2026
cfdc17b
docs(27-1): complete Observability & Diagnostics plan
contrasam Apr 11, 2026
d7ab8e5
plan(28-1): define Phase 28 — Reliability Hardening
contrasam Apr 11, 2026
98a57c6
feat(28-1): add NodeCircuitBreaker and CircuitBreakerOpenException
contrasam Apr 11, 2026
5c53d33
feat(28-1): wire per-node circuit breaker into ReliableMessagingSystem
contrasam Apr 11, 2026
b46d60b
feat(28-1): add ExponentialBackoff and wire retry into EtcdMetadataStore
contrasam Apr 11, 2026
8d713c4
feat(28-1): add actor-assignment cache for graceful etcd degradation
contrasam Apr 11, 2026
70f3df5
refactor(28-1): improve error messages in cluster code with full acto…
contrasam Apr 11, 2026
0087643
test(28-1): unit tests for circuit breaker, exponential backoff, and …
contrasam Apr 11, 2026
ea65585
docs(28-1): complete Reliability Hardening plan
contrasam Apr 11, 2026
dceb35e
plan(29-1): define Phase 29 — Performance Optimization
contrasam Apr 11, 2026
e949a91
feat(29-1): add TtlCache — TTL-aware concurrent cache
contrasam Apr 11, 2026
17f8f08
feat(29-1): upgrade actorAssignmentCache to TtlCache with primary cac…
contrasam Apr 11, 2026
7270780
feat(29-1): configure gRPC keep-alive settings in EtcdMetadataStore
contrasam Apr 11, 2026
18555b9
feat(29-1): add batchRegisterActors for parallel metadata store regis…
contrasam Apr 11, 2026
28c5802
feat: merge cluster improvements from feature/roux-effect-integration
contrasam Apr 11, 2026
0ff8ec5
test(29-1): TtlCache unit tests and cluster routing benchmark
contrasam Apr 11, 2026
4b46d85
docs(29-1): complete Performance Optimization plan
contrasam Apr 11, 2026
159fc2c
fix(cluster): address three P1 code review findings
contrasam Apr 11, 2026
cb8672f
docs(29-2): record P1 code review fixes in milestone
contrasam Apr 11, 2026
7da7a6f
docs(30): plan Phase 30 — Cluster Management API (2 plans)
contrasam Apr 11, 2026
c884fd5
refactor(30-1): make cluster key-prefix constants package-private
contrasam Apr 11, 2026
8d55b8d
feat(30-1): implement ClusterConfiguration builder
contrasam Apr 11, 2026
002251a
feat(30-1): implement ClusterManagementApi interface and DefaultClust…
contrasam Apr 11, 2026
3ccf5fb
feat(30-1): wire getManagementApi() and invalidateActorAssignmentCach…
contrasam Apr 11, 2026
92d315f
test(30-1): ClusterConfiguration builder and ClusterManagementApi rea…
contrasam Apr 11, 2026
626a008
docs(30-1): complete ClusterConfiguration and ClusterManagementApi re…
contrasam Apr 11, 2026
a9b32f4
feat(30-2): implement migrateActor and drainNode in DefaultClusterMan…
contrasam Apr 11, 2026
2a0ad24
refactor(30-2): extract InMemoryMetadataStore to shared test helper
contrasam Apr 11, 2026
3d91289
fix(30-2): add shutdownLocalOnly to prevent migration from overwritin…
contrasam Apr 11, 2026
24625cd
test(30-2): ClusterManagementApi migrate and drain tests
contrasam Apr 11, 2026
c542ed0
docs(30-2): complete ClusterManagementApi migrateActor and drainNode …
contrasam Apr 11, 2026
0151563
docs(31): plan Phase 31 — Testing, Documentation & Examples (2 plans)
contrasam Apr 11, 2026
989a8f8
refactor(31-1): extract WatchableInMemoryMetadataStore and InMemoryMe…
contrasam Apr 11, 2026
5928f0c
test(31-1): chaos tests for sequential node failures and cluster reco…
contrasam Apr 11, 2026
3f6c3ac
test(31-1): cluster lifecycle tests for planned drain and graceful sh…
contrasam Apr 11, 2026
fe1a521
fix(31-1): preserve actor metadata on system stop; fix acquireLock race
contrasam Apr 11, 2026
17b1ccf
fix(31-1): atomic acquireLock in WatchableInMemoryMetadataStore; publ…
contrasam Apr 11, 2026
ca9fcb0
docs(31-1): complete cluster integration tests plan
contrasam Apr 11, 2026
26e2eb8
docs(31-2): rewrite cluster_mode.md with Phase 27-30 enhancements
contrasam Apr 11, 2026
022d751
docs(31-2): rewrite cluster_mode.md with Phase 27-30 enhancements
contrasam Apr 11, 2026
5be33e5
docs(31-2): create cluster-deployment.md production deployment guide
contrasam Apr 11, 2026
3019572
docs(31-2): create cluster-serialization.md migration guide
contrasam Apr 11, 2026
ab6be1c
test(31-2): add 100-message state recovery test to StatefulActorClust…
contrasam Apr 11, 2026
afa6809
feat(31-2): add ClusterStatefulRecoveryExample runnable demo
contrasam Apr 11, 2026
359d269
docs(31-2): complete documentation and examples plan
contrasam Apr 11, 2026
24e8b6a
docs(31-2): mark Phase 31 complete in ROADMAP
contrasam Apr 11, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 139 additions & 0 deletions .planning/ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,3 +122,142 @@ Roux upgraded to 0.2.1, AutoCloseable fix, concurrency/resource/timeout examples
## ~~Milestone 4: Doc Audit & v0.7.0 Release~~ ✅ `v0.7.0`

Doc audit (25 files), README rewrite, 4 legacy effect docs archived, version bumped 0.4.0→0.7.0, 629 tests green. → [Archive](.planning/milestones/v0.7.0-ROADMAP.md)

---

---

## Milestone 5: Cluster Evaluation & Enhancement `v0.8.0`

**Phases**: 22–31 | **Branch**: `feature/roux-effect-integration`

Evaluate and harden the cluster module. Introduce a pluggable serialization layer to replace Java native serialization throughout messaging and persistence. Add Redis-backed shared persistence so StatefulActors survive node reassignment. Harden cluster reliability, observability, and operations.

---

### ~~Phase 22: Cluster & Persistence Audit~~ ✅

**Goal**: Written findings document covering gaps, risks, and test coverage blind spots in both the cluster module and the persistence system.

Plans:
- 22.1 Audit `ClusterActorSystem`, `ReliableMessagingSystem`, `MessageTracker`, `EtcdMetadataStore`, `RendezvousHashing` — document design gaps and risks
- 22.2 Audit `StatefulActor` recovery + `PersistenceProvider` impls — demonstrate the state-loss-on-reassignment bug with a failing test
- 22.3 Audit existing cluster test coverage — identify blind spots (node failure, message ordering, split-brain)
- 22.4 Produce findings document summarising all issues, prioritised for the phases ahead

---

### ~~Phase 23: Serialization Framework~~ ✅

**Goal**: Pluggable `SerializationProvider` interface with Kryo and JSON implementations, wired into `ReliableMessagingSystem`, `FileMessageJournal`, and `LmdbMessageJournal`. Message types no longer required to implement `Serializable`.

Plans:
- 23.1 Evaluate Kryo vs Jackson vs protobuf — pick Kryo as primary (performance, no-schema) and Jackson JSON as secondary (debuggability)
- 23.2 Design and implement `SerializationProvider` interface: `byte[] serialize(Object)` / `<T> T deserialize(byte[], Class<T>)`
- 23.3 Implement `KryoSerializationProvider` and `JsonSerializationProvider`
- 23.4 Wire `SerializationProvider` into `ReliableMessagingSystem` (inter-node messages) and `FileMessageJournal` / `LmdbMessageJournal` (persistence)
- 23.5 Tests: round-trip serialization, cross-version compatibility, verify `Serializable` constraint removed from message types

---

### ~~Phase 24: Redis Persistence Design~~ ✅

**Goal**: Documented schema design for Redis-backed journal and snapshot stores, with evaluated tradeoffs vs existing file and LMDB providers.

Plans:
- 24.1 Evaluate Redis data structures for journal (Streams vs Lists vs Sorted Sets) and snapshot (Hash vs String vs JSON)
- 24.2 Design key namespace: `cajun:journal:{actorId}` and `cajun:snapshot:{actorId}`
- 24.3 Evaluate Redis persistence modes (RDB, AOF, no persistence) and their impact on actor durability guarantees
- 24.4 Document tradeoffs: Redis vs LMDB vs file — latency, throughput, cross-node access, operational complexity
- 24.5 Choose Redis client library (Lettuce vs Jedis) and add dependency to `cajun-persistence/build.gradle`

---

### ~~Phase 25: Redis Persistence Provider~~ ✅

**Goal**: `RedisPersistenceProvider`, `RedisMessageJournal`, and `RedisSnapshotStore` implemented, tested, and registered in `PersistenceProviderRegistry`. Uses `SerializationProvider` from Phase 23.

Plans:
- 25.1 Implement `RedisMessageJournal<M>`: `append`, `readFrom`, `truncateBefore`, `getHighestSequenceNumber` using Redis Streams
- 25.2 Implement `RedisSnapshotStore<S>`: `saveSnapshot`, `getLatestSnapshot`, `deleteSnapshots` using Redis Hash/String
- 25.3 Implement `RedisPersistenceProvider` factory; register as `"redis"` in `PersistenceProviderRegistry`
- 25.4 Unit tests: round-trip journal append/read, snapshot save/load, truncation, sequence number tracking
- 25.5 Integration tests: `StatefulActor` with Redis provider — full journal replay and snapshot recovery

---

### ~~Phase 26: Cluster + Shared Persistence Integration~~ ✅

**Goal**: `ClusterActorSystem` uses Redis-backed persistence so StatefulActors recover their full state when reassigned to a new node. `PidRehydrator` verified cross-node. Benchmark Redis vs LMDB vs file.

Plans:
- 26.1 Wire `RedisPersistenceProvider` as the default persistence provider in `ClusterActorSystem` startup
- 26.2 Add persistence health check to cluster startup — fail fast if Redis unreachable
- 26.3 Verify `PidRehydrator` correctly resolves `Pid` references across nodes after actor migration
- 26.4 Integration test: actor on node A accumulates state → node A killed → actor reassigned to node B → verify full state recovered from Redis
- 26.5 Benchmark: message throughput and recovery latency for Redis vs LMDB vs file-based persistence

---

### ~~Phase 27: Observability & Diagnostics~~ ✅

**Goal**: Metrics API exposing throughput, latency, and cluster health; structured logging throughout cluster and persistence code.

Plans:
- 27.1 Define `ClusterMetrics` API: messages routed (local vs remote), routing latency, node count, actor count per node
- 27.2 Define `PersistenceMetrics` API: journal append latency, snapshot save/load latency, provider health status
- 27.3 Implement metrics collection in `ClusterActorSystem` and `ReliableMessagingSystem`
- 27.4 Add cluster health check: `ClusterActorSystem.healthCheck()` returning node liveness, leader status, persistence reachability
- 27.5 Improve structured logging (MDC or structured log fields) throughout cluster and persistence paths

---

### ~~Phase 28: Reliability Hardening~~ ✅

**Goal**: Circuit breaker for node-to-node calls; exponential backoff for metadata store operations; graceful degradation when etcd is unavailable.

Plans:
- 28.1 Implement circuit breaker for `ReliableMessagingSystem` — open on repeated node failures, half-open probe, close on recovery
- 28.2 Add exponential backoff with jitter to `EtcdMetadataStore` retry logic for transient failures
- 28.3 Graceful degradation: define behaviour when etcd is unreachable (continue routing with cached assignments vs fail-fast)
- 28.4 Improve error messages throughout cluster code — include node IDs, actor IDs, and failure context
- 28.5 Tests: circuit breaker state transitions, retry backoff behaviour, degraded-mode routing

---

### Phase 29: Performance Optimization

**Goal**: Local actor-location cache reduces etcd round-trips; EtcdMetadataStore uses connection pooling; batch actor registration on startup.

Plans:
- 29.1 Implement local actor-location cache in `ClusterActorSystem` with TTL-based invalidation and watcher-driven eviction
- 29.2 Add connection pooling to `EtcdMetadataStore` (configure pool size, idle timeout)
- 29.3 Batch actor registration on node startup — single bulk write instead of one etcd put per actor
- 29.4 Profile hot paths under load — identify any remaining bottlenecks in routing or messaging
- 29.5 Benchmarks: routing throughput and latency with and without caching; registration time with batch vs sequential

---

### Phase 30: Cluster Management API

**Goal**: Programmatic API for cluster operations — node listing, actor migration, node draining. Config builder for simpler cluster setup.

Plans:
- 30.1 Implement `ClusterManagementApi`: `listNodes()`, `listActors(nodeId)`, `migrateActor(actorId, targetNodeId)`, `drainNode(nodeId)`
- 30.2 Implement `ClusterConfiguration` builder — fluent API replacing raw constructor args for `ClusterActorSystem`
- 30.3 Node drain: stop accepting new actors, migrate existing actors gracefully, signal completion
- 30.4 Actor migration: move actor assignment in metadata store, trigger recovery on target node, verify state continuity
- 30.5 Tests: drain-and-rejoin cycle, forced migration with state verification

---

### Phase 31: Testing, Documentation & Examples

**Goal**: Chaos tests, cross-node state recovery tests, serialization migration guide, and production deployment guide.

Plans:
- 31.1 Chaos tests: kill random nodes under load, verify actors recover state from Redis, verify message delivery guarantees hold
- 31.2 Cross-node state recovery test: stateful actor accumulates state across 100 messages → node killed → recovered on new node → state verified
- 31.3 Serialization migration guide: how to move from Java native serialization to Kryo/JSON for existing actors
- 31.4 Production deployment guide: etcd setup, Redis setup, cluster configuration, persistence configuration, observability
- 31.5 Runnable cluster example: two-node setup with StatefulActor, Redis persistence, node failure and recovery demonstrated
53 changes: 48 additions & 5 deletions .planning/STATE.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,28 @@
# Project State

## Current Status
**Milestone**: 4Doc Audit & v0.7.0 Release
**Phase**: 21 ✅ Complete — Milestone 4 archived as v0.7.0
**Status**: Milestone complete — 629 tests, 0 failures, all audits clean
**Milestone**: 5Cluster Evaluation & Enhancement
**Phase**: 29 — Planned
**Status**: Phase 29 plan written (29-1-PLAN.md) — ready to execute
**Branch**: feature/roux-effect-integration
**Last Updated**: 2026-04-01
**Last Updated**: 2026-04-11

## Milestone 4 Phase Progress
## Milestone 5 Phase Progress

| Phase | Name | Status |
|-------|------|--------|
| 22 | Cluster & Persistence Audit | ✅ Complete |
| 23 | Serialization Framework | ✅ Complete |
| 24 | Redis Persistence Design | ✅ Complete |
| 25 | Redis Persistence Provider | ✅ Complete |
| 26 | Cluster + Shared Persistence Integration | ✅ Complete |
| 27 | Observability & Diagnostics | ✅ Complete |
| 28 | Reliability Hardening | ✅ Complete |
| 29 | Performance Optimization | 📋 Planned |
| 30 | Cluster Management API | 🔲 Not started |
| 31 | Testing, Documentation & Examples | 🔲 Not started |

## Milestone 4 Phase Progress (archived — v0.7.0)

| Phase | Name | Status |
|-------|------|--------|
Expand Down Expand Up @@ -102,6 +117,34 @@
- `Effect.generate()` requires `.widen()` on the handler — pass `handler.widen()`, not the raw handler
- `CapabilityHandler.compose(h1, h2, h3)` accepts raw unwidened handlers; returns `CapabilityHandler<Capability<?>>`

## Decisions Made (Milestone 5 — Phase 28)
- `NodeCircuitBreaker` per-node (not per-actor): one node failure blocks all messages to that node; `failureThreshold=5`, `resetTimeoutMs=30s` defaults
- Circuit breaker implemented with `synchronized` + `volatile` — simpler than lock-free for low-contention send path
- `ExponentialBackoff` wraps only idempotent etcd ops (`put/get/delete/listKeys`); `acquireLock` excluded (double-acquire risk); watch/connect/close excluded
- Graceful degradation via `exceptionally()` on `metadataStore.get()` future — zero overhead on happy path; WARN on cache hit, ERROR on cache miss (message dropped)
- `DegradedRoutingTest` key prefix: `ClusterActorSystem.ACTOR_ASSIGNMENT_PREFIX` is `"cajun/actor/"` — test corrected to match

## Decisions Made (Milestone 5 — Phase 27)
- `ClusterMetrics` and `PersistenceMetrics` placed in `cajun-core/src/main/java/com/cajunsystems/metrics/` — `ReliableMessagingSystem` is in `cajun-core` so metrics must be co-located
- `ClusterMetrics` injected into `ReliableMessagingSystem` via optional setter `setClusterMetrics()` with null guards — two copies of `ReliableMessagingSystem` exist (cajun-core + lib), both updated
- `ClusterHealthStatus` record: `healthy = persistenceHealthy && messagingSystemRunning`; `persistenceHealthy=true` when no provider configured (backward compat)
- MDC cleared via try-finally in `doSendMessage()` and `handleClient()` — prevents leakage on exception
- `logback.xml` pattern: `[%X{actorId}][%X{messageId}]` added — empty strings for non-cluster log lines

## Decisions Made (Milestone 5 — Phase 26)
- `ClusterActorSystem.withPersistenceProvider(PersistenceProvider)` fluent setter; `setupPersistence()` called in `start()` before heartbeat/leader election — no-op if null
- Persistence health check at startup: WARN log (not fail-fast) to preserve backward compat
- `StatefulActorClusterStateTest`: @Disabled removed, `@Tag("requires-redis")` added, shared `RedisPersistenceProvider` used for both nodes — test now asserts count=6
- Original bug-doc test kept `@Disabled` at method level as historical documentation
- `PersistenceBenchmarkTest`: `@Tag("performance")` only; Redis tests also `@Tag("requires-redis")`; N=500 messages; no latency SLA assertions

## Decisions Made (Milestone 5 — Phase 25)
- Redis journal key: `{prefix}:journal:{actorId}` (actorId in `{}` for Cluster co-location); seq counter: `{prefix}:journal:{actorId}:seq`
- Redis snapshot key: `{prefix}:snapshot:{actorId}` — single key per actor, overwrite semantics
- `RedisPersistenceProvider` defaults to `JavaSerializationProvider`; integration tests use `KryoSerializationProvider` explicitly
- Mocking Lettuce `RedisFuture` in tests: use concrete anonymous `RedisFuture` implementation wrapping `CompletableFuture` — avoids Mockito strict-stubbing issues with default `CompletionStage` interface methods
- Integration tests tagged `@Tag("requires-redis")` — excluded from default Gradle test task in both `cajun-persistence` and `lib`

## Decisions Made (Milestone 2)
- Audience: Cajun library users (self-contained, easy to run)
- Stateful approach: ask-pattern composition, not AtomicReference-inside-effect
Expand Down
Loading
Loading