[WIP] Cluster modules improvements by contrasam · Pull Request #34 · CajunSystems/cajun

contrasam · 2026-04-11T08:52:42Z

No description provided.

10 phases (22–31): cluster + persistence audit, pluggable serialization (Kryo/JSON replacing Java native serialization), Redis shared persistence for cross-node StatefulActor recovery, cluster observability, reliability hardening, performance optimization, management API, and final validation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@disabled

Documents the bug where StatefulActor state is lost when a cluster node fails and the actor is reassigned to another node. The test is @disabled until Phase 26 (Cluster + Shared Persistence Integration) fixes this. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Critical: StatefulActor state lost on cluster reassignment (no shared persistence). High: Java native serialization for inter-node messages, MessageTracker resource leak, no retry/backoff in EtcdMetadataStore. Findings mapped to Phases 23-31. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Phase 22 complete. Findings document covers 11 issues (2 critical, 4 high, 5 medium) across cluster and persistence modules, prioritised for Phases 23-31. State-loss-on-reassignment bug documented with regression test. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…plementations Defines pluggable byte[] serialize/deserialize contract in cajun-core. JavaSerializationProvider is the backward-compat fallback. KryoSerializationProvider (primary) uses ThreadLocal Kryo with registration-not-required + objenesis. JsonSerializationProvider (secondary) uses Jackson with full type info. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replaces ObjectOutputStream/ObjectInputStream with provider.serialize/deserialize. Length-prefixed framing (4-byte int + payload) used for stream-based transport. RemoteMessage and MessageAcknowledgment no longer require Serializable. lib copy defaults to KryoSerializationProvider; cajun-core copy defaults to JavaSerializationProvider. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…leSnapshotStore Replaces ObjectOutputStream/ObjectInputStream with provider.serialize/deserialize. Journal and snapshot entries written as raw byte files. Default provider is JavaSerializationProvider for backward compatibility with existing journal files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…t store Removes T extends Serializable constraint from LmdbMessageJournal, LmdbSnapshotStore, and LmdbBatchedMessageJournal. serializeEntry/deserializeEntry now delegate to SerializationProvider. Default provider is JavaSerializationProvider for backward compat. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…d Java providers Tests cover: non-Serializable records via Kryo, sealed interfaces, JournalEntry generics, concurrent Kryo isolation, FileMessageJournal integration (Kryo + Java backward-compat). Non-Serializable types verified to fail with Java provider. Kryo provider now uses Objenesis StdInstantiatorStrategy for classes without no-arg constructors (Java records, final classes). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Phase 23 complete. SerializationProvider interface + Kryo/JSON/Java implementations wired into ReliableMessagingSystem, FileMessageJournal, and LmdbMessageJournal. Serializable constraint removed from LMDB journals. Tests green (15/15). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Documents Redis key namespace (Hash for journal, String for snapshot), command mapping for all MessageJournal and SnapshotStore operations, Redis Cluster hash tag co-location strategy, AOF persistence mode guidance, tradeoff comparison vs LMDB and file, Lettuce async API design notes, and 6 known limitations/risks. Basis for Phase 25 implementation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…e and lib Adds io.lettuce:lettuce-core:6.3.2.RELEASE as the Redis client library. Lettuce chosen over Jedis for async-first API (CompletableFuture native) compatible with MessageJournal/SnapshotStore async contract. No implementation yet — RedisPersistenceProvider implemented in Phase 25. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Phase 24 complete. Schema design covers Hash-based journal, String-based snapshot, key namespace with Cluster hash tags, AOF durability guidance, Redis vs LMDB vs file tradeoffs, and Lettuce async API patterns. Lettuce 6.3.2.RELEASE added to cajun-persistence and lib. All tests green. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Implements MessageJournal<M> on top of Redis using Lettuce. Uses a Lua script for atomic sequence-counter increment + HSET in a single round-trip. readFrom filters and sorts hash fields by sequence number, truncateBefore deletes fields strictly below the threshold, and getHighestSequenceNumber reads the counter key returning -1 for absent keys. Actor IDs with colons are sanitized to underscores; hash-tag notation forces Redis Cluster co-location. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…mantics Implements SnapshotStore<S> backed by Redis. Each actor stores exactly one snapshot (latest wins) at key {prefix}:snapshot:{sanitized(actorId)}. The full SnapshotEntry (actorId + state + sequenceNumber) is serialized and stored as a Redis String. isHealthy() uses a synchronous PING check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…s tag RedisPersistenceProvider creates a single shared Lettuce connection with StringCodec/ByteArrayCodec and provides factory methods for RedisMessageJournal and RedisSnapshotStore. Batched journals are wrapped in SimpleBatchedMessageJournal. The 'requires-redis' JUnit tag is added to excludeTags in both cajun-persistence and lib test tasks so integration tests are not run without a Redis instance. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tuce - RedisMessageJournalTest: 8 tests covering append, readFrom, truncateBefore, getHighestSequenceNumber, and actorId sanitization using mocked async commands - RedisSnapshotStoreTest: 4 tests covering saveSnapshot, getLatestSnapshot, deleteSnapshots using mocked Lettuce connection - Use concrete RedisFuture anonymous impl to avoid Mockito strict-stubbing issues with default interface methods (toCompletableFuture) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@tag

…ulActor recovery - RedisIntegrationTest: 7 tests covering journal round-trip, truncation, highest sequence number, snapshot save/load/delete/overwrite, and health check - RedisStatefulActorTest: verifies Phase 22 C1 bug fix — actor recovers state from Redis after full ActorSystem restart, then continues processing correctly - Both test classes tagged @tag("requires-redis") and excluded from default test run Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…startup health check Registers the given provider in PersistenceProviderRegistry and sets it as the default during start(), so all StatefulActors on the node use shared persistence. Logs WARN if provider reports unhealthy at startup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Verifies null-system replacement, existing-system preservation, record traversal, nested Pid rehydration, and null-state handling. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@disabled

…dis cross-node Replaces MockBatchedMessageJournal/MockSnapshotStore with shared RedisPersistenceProvider so both system1 and system2 use the same Redis journal/snapshot keys. StatefulActor on system2 replays 5 messages from Redis and recovers count=5; 1 more increment → count=6. Original @disabled test kept as historical bug documentation at method level. Tagged requires-redis, UUID actor IDs prevent cross-run interference. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…append and recovery Measures message throughput and recovery latency for FileMessageJournal and RedisMessageJournal over N=500 messages. Results printed to stdout. Tagged performance (+ requires-redis for Redis tests) — excluded from default runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…essagingSystem Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…lthCheck() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… update logback.xml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Double-counted remote message failures: routeToNode().exceptionally() now skips incrementRemoteMessageFailures() when messagingSystem is a ReliableMessagingSystem (doSendMessage already counts the failure) - Silent SerializationException in handleClient(): add explicit catch before IOException so deserialization errors are logged rather than swallowed by the executor's uncaught handler - Jackson RCE via DefaultTyping.EVERYTHING: replace allowIfBaseType(Object) + EVERYTHING with configurable trusted package prefixes and NON_FINAL; default INSTANCE restricts to com.cajunsystems.*, java.*, javax.* Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add 29-2-PLAN.md and 29-2-SUMMARY.md documenting the three P1 fixes (double-counted metrics, silent SerializationException, Jackson RCE). Update STATE.md decisions and ROADMAP.md Phase 29 section. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

30-1: ClusterConfiguration builder + ClusterManagementApi interface + listNodes/listActors read ops + 10 unit tests 30-2: migrateActor + drainNode implementations + drain-and-rejoin tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…erManagementApi Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…e() into ClusterActorSystem Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…d-op tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ad-ops plan Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…agementApi migrateActor: validates target via metadata store (authoritative, not in-memory knownNodes cache), updates etcd assignment, invalidates TTL cache, stops local actor when this node is the source so state is persisted before lazy recovery. drainNode: queries live node list from etcd, excludes drained node, distributes actors via RendezvousHashing, runs migrations in parallel, swallows per-actor failures (best-effort drain), fails fast if no other nodes available. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Moves the ConcurrentHashMap-backed MetadataStore stub from an inner class in ClusterManagementApiReadTest to a package-private top-level class, available to all cluster test files without duplication. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…g etcd assignment ClusterActorSystem.shutdown(actorId) deletes the actor's etcd entry as part of unregistering from the cluster. This caused migrateActor to write the new target assignment then immediately delete it when stopping the local mailbox. Add shutdownLocalOnly() (package-private) that calls super.shutdown() only, and use it in DefaultClusterManagementApi.migrateActor() so the new assignment written to etcd is preserved after the local actor stops. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

4 migrate tests: updatesMetadataStore, unknownTarget, notCurrentlyAssigned, listActors reflects new assignment. 5 drain tests: migratesAll, emptyNode, singleNodeFails, consistentHashing distribution, drain-and-rejoin cycle. All use InMemoryMetadataStore + NoopMessagingSystem — no etcd required. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…plan Phase 30 fully complete. Add 30-2-SUMMARY documenting the shutdownLocalOnly deviation. Update STATE.md decisions and ROADMAP.md Phase 30 to ✅. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

31-1: Shared test helper extraction (WatchableInMemoryMetadataStore + InMemoryMessagingSystem) + 3 chaos tests + 3 lifecycle/drain tests 31-2: cluster_mode.md rewrite + cluster-deployment.md + cluster-serialization.md + 100-msg Redis state recovery test + ClusterStatefulRecoveryExample Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ssagingSystem as shared test helpers Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…very

…utdown

Two bugs fixed while writing the cluster lifecycle integration tests: 1. shutdownLocalOnly() leak: Actor.stop() always calls system.shutdown(actorId) which deleted the metadata entry — defeating the purpose of shutdownLocalOnly. Fix: track actors in skipMetadataDeleteActors; shutdown(actorId) skips the delete when the flag is set. 2. System stop deletes actor metadata: ClusterActorSystem.stop() called super.shutdown() which stopped each actor, triggering Actor.stop() → system.shutdown(actorId) → metadataStore.delete(). This erased assignments before the leader could reassign them to surviving nodes. Fix: suppress metadata deletion for all local actors during stop(); node key deletion already signals departure — individual actor metadata is left for the leader to redistribute. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ic TestActor classes

Replaces the original 141-line stub with a ~300-line document covering ClusterConfiguration builder, ClusterManagementApi, ClusterMetrics, NodeCircuitBreaker, TTL cache, Redis persistence, and serialization. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Expanded from 141 to ~300 lines. Covers cross-node state recovery (Phase 26+), ClusterManagementApi with rolling upgrade workflow, NodeCircuitBreaker, ClusterMetrics, health check, graceful degradation, and an updated production checklist. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

New ~200-line guide covering etcd cluster setup, Redis durability config, JVM startup bootstrap, health monitoring with Prometheus, rolling upgrade procedure, Kubernetes StatefulSet example, and a troubleshooting section. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

New ~150-line guide covering KryoSerializationProvider vs JsonSerializationProvider tradeoffs, security model, schema evolution with Kryo, Kryo-to-JSON migration procedure, and verification steps. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…erStateTest New testStatefulActorFullLifecycle_100Messages verifies that a StatefulActor accumulating 100 Redis-journaled messages fully recovers on a new cluster node, then correctly processes additional messages (expected total: 101). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Self-contained example demonstrating stateful actor recovery across a simulated cluster failover. Uses inline SimpleMetadataStore and SimpleMessagingSystem (no real etcd/gRPC needed), RedisPersistenceProvider, 20 increments on system1, drainNode + stop, then recovery and 5 more increments on system2 verified at count=25. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Mark 31-2 as complete in PLAN.md, create 31-2-SUMMARY.md with commit hashes and deviations, update STATE.md (Phase 31 complete, decisions logged), mark 31-2 as done in ROADMAP.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

contrasam and others added 30 commits April 11, 2026 09:54

docs(25-1): complete Redis Persistence Provider plan

825776a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test(26-1): unit tests for PidRehydrator in cluster/recovery context

91748dc

Verifies null-system replacement, existing-system preservation, record traversal, nested Pid rehydration, and null-state handling. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(26-1): complete Cluster + Shared Persistence Integration plan

1a12433

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

plan(27-1): define Phase 27 — Observability & Diagnostics

fdc2c9b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(27-1): add ClusterMetrics — routing and node counters

583ebb7

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(27-1): add PersistenceMetrics and PersistenceMetricsRegistry

a686231

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(27-1): wire ClusterMetrics into ClusterActorSystem and ReliableM…

4188e3f

…essagingSystem Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(27-1): add ClusterHealthStatus record and ClusterActorSystem.hea…

50b901f

…lthCheck() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(27-1): add MDC structured logging to ReliableMessagingSystem and…

22f71cd

… update logback.xml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

contrasam and others added 30 commits April 11, 2026 14:33

docs(29-1): complete Performance Optimization plan

4b46d85

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

refactor(30-1): make cluster key-prefix constants package-private

c884fd5

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(30-1): implement ClusterConfiguration builder

8d55b8d

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(30-1): implement ClusterManagementApi interface and DefaultClust…

002251a

…erManagementApi Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(30-1): wire getManagementApi() and invalidateActorAssignmentCach…

3ccf5fb

…e() into ClusterActorSystem Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test(30-1): ClusterConfiguration builder and ClusterManagementApi rea…

92d315f

…d-op tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(30-1): complete ClusterConfiguration and ClusterManagementApi re…

626a008

…ad-ops plan Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(30-2): complete ClusterManagementApi migrateActor and drainNode …

c542ed0

…plan Phase 30 fully complete. Add 30-2-SUMMARY documenting the shutdownLocalOnly deviation. Update STATE.md decisions and ROADMAP.md Phase 30 to ✅. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

refactor(31-1): extract WatchableInMemoryMetadataStore and InMemoryMe…

989a8f8

…ssagingSystem as shared test helpers Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test(31-1): chaos tests for sequential node failures and cluster reco…

5928f0c

…very

test(31-1): cluster lifecycle tests for planned drain and graceful sh…

3f6c3ac

…utdown

fix(31-1): atomic acquireLock in WatchableInMemoryMetadataStore; publ…

17b1ccf

…ic TestActor classes

docs(31-1): complete cluster integration tests plan

ca9fcb0

docs(31-2): mark Phase 31 complete in ROADMAP

24e8b6a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Cluster modules improvements#34

[WIP] Cluster modules improvements#34
contrasam wants to merge 77 commits intomainfrom
feature/cluster-improvements

contrasam commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

contrasam commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant