Open
Conversation
10 phases (22–31): cluster + persistence audit, pluggable serialization (Kryo/JSON replacing Java native serialization), Redis shared persistence for cross-node StatefulActor recovery, cluster observability, reliability hardening, performance optimization, management API, and final validation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents the bug where StatefulActor state is lost when a cluster node fails and the actor is reassigned to another node. The test is @disabled until Phase 26 (Cluster + Shared Persistence Integration) fixes this. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Critical: StatefulActor state lost on cluster reassignment (no shared persistence). High: Java native serialization for inter-node messages, MessageTracker resource leak, no retry/backoff in EtcdMetadataStore. Findings mapped to Phases 23-31. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 22 complete. Findings document covers 11 issues (2 critical, 4 high, 5 medium) across cluster and persistence modules, prioritised for Phases 23-31. State-loss-on-reassignment bug documented with regression test. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…plementations Defines pluggable byte[] serialize/deserialize contract in cajun-core. JavaSerializationProvider is the backward-compat fallback. KryoSerializationProvider (primary) uses ThreadLocal Kryo with registration-not-required + objenesis. JsonSerializationProvider (secondary) uses Jackson with full type info. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces ObjectOutputStream/ObjectInputStream with provider.serialize/deserialize. Length-prefixed framing (4-byte int + payload) used for stream-based transport. RemoteMessage and MessageAcknowledgment no longer require Serializable. lib copy defaults to KryoSerializationProvider; cajun-core copy defaults to JavaSerializationProvider. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…leSnapshotStore Replaces ObjectOutputStream/ObjectInputStream with provider.serialize/deserialize. Journal and snapshot entries written as raw byte files. Default provider is JavaSerializationProvider for backward compatibility with existing journal files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t store Removes T extends Serializable constraint from LmdbMessageJournal, LmdbSnapshotStore, and LmdbBatchedMessageJournal. serializeEntry/deserializeEntry now delegate to SerializationProvider. Default provider is JavaSerializationProvider for backward compat. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d Java providers Tests cover: non-Serializable records via Kryo, sealed interfaces, JournalEntry generics, concurrent Kryo isolation, FileMessageJournal integration (Kryo + Java backward-compat). Non-Serializable types verified to fail with Java provider. Kryo provider now uses Objenesis StdInstantiatorStrategy for classes without no-arg constructors (Java records, final classes). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 23 complete. SerializationProvider interface + Kryo/JSON/Java implementations wired into ReliableMessagingSystem, FileMessageJournal, and LmdbMessageJournal. Serializable constraint removed from LMDB journals. Tests green (15/15). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents Redis key namespace (Hash for journal, String for snapshot), command mapping for all MessageJournal and SnapshotStore operations, Redis Cluster hash tag co-location strategy, AOF persistence mode guidance, tradeoff comparison vs LMDB and file, Lettuce async API design notes, and 6 known limitations/risks. Basis for Phase 25 implementation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e and lib Adds io.lettuce:lettuce-core:6.3.2.RELEASE as the Redis client library. Lettuce chosen over Jedis for async-first API (CompletableFuture native) compatible with MessageJournal/SnapshotStore async contract. No implementation yet — RedisPersistenceProvider implemented in Phase 25. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 24 complete. Schema design covers Hash-based journal, String-based snapshot, key namespace with Cluster hash tags, AOF durability guidance, Redis vs LMDB vs file tradeoffs, and Lettuce async API patterns. Lettuce 6.3.2.RELEASE added to cajun-persistence and lib. All tests green. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements MessageJournal<M> on top of Redis using Lettuce. Uses a Lua script for atomic sequence-counter increment + HSET in a single round-trip. readFrom filters and sorts hash fields by sequence number, truncateBefore deletes fields strictly below the threshold, and getHighestSequenceNumber reads the counter key returning -1 for absent keys. Actor IDs with colons are sanitized to underscores; hash-tag notation forces Redis Cluster co-location. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mantics
Implements SnapshotStore<S> backed by Redis. Each actor stores exactly one
snapshot (latest wins) at key {prefix}:snapshot:{sanitized(actorId)}. The
full SnapshotEntry (actorId + state + sequenceNumber) is serialized and
stored as a Redis String. isHealthy() uses a synchronous PING check.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s tag RedisPersistenceProvider creates a single shared Lettuce connection with StringCodec/ByteArrayCodec and provides factory methods for RedisMessageJournal and RedisSnapshotStore. Batched journals are wrapped in SimpleBatchedMessageJournal. The 'requires-redis' JUnit tag is added to excludeTags in both cajun-persistence and lib test tasks so integration tests are not run without a Redis instance. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tuce - RedisMessageJournalTest: 8 tests covering append, readFrom, truncateBefore, getHighestSequenceNumber, and actorId sanitization using mocked async commands - RedisSnapshotStoreTest: 4 tests covering saveSnapshot, getLatestSnapshot, deleteSnapshots using mocked Lettuce connection - Use concrete RedisFuture anonymous impl to avoid Mockito strict-stubbing issues with default interface methods (toCompletableFuture) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ulActor recovery - RedisIntegrationTest: 7 tests covering journal round-trip, truncation, highest sequence number, snapshot save/load/delete/overwrite, and health check - RedisStatefulActorTest: verifies Phase 22 C1 bug fix — actor recovers state from Redis after full ActorSystem restart, then continues processing correctly - Both test classes tagged @tag("requires-redis") and excluded from default test run Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…startup health check Registers the given provider in PersistenceProviderRegistry and sets it as the default during start(), so all StatefulActors on the node use shared persistence. Logs WARN if provider reports unhealthy at startup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Verifies null-system replacement, existing-system preservation, record traversal, nested Pid rehydration, and null-state handling. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…dis cross-node Replaces MockBatchedMessageJournal/MockSnapshotStore with shared RedisPersistenceProvider so both system1 and system2 use the same Redis journal/snapshot keys. StatefulActor on system2 replays 5 messages from Redis and recovers count=5; 1 more increment → count=6. Original @disabled test kept as historical bug documentation at method level. Tagged requires-redis, UUID actor IDs prevent cross-run interference. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…append and recovery Measures message throughput and recovery latency for FileMessageJournal and RedisMessageJournal over N=500 messages. Results printed to stdout. Tagged performance (+ requires-redis for Redis tests) — excluded from default runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…essagingSystem Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lthCheck() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… update logback.xml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Double-counted remote message failures: routeToNode().exceptionally() now skips incrementRemoteMessageFailures() when messagingSystem is a ReliableMessagingSystem (doSendMessage already counts the failure) - Silent SerializationException in handleClient(): add explicit catch before IOException so deserialization errors are logged rather than swallowed by the executor's uncaught handler - Jackson RCE via DefaultTyping.EVERYTHING: replace allowIfBaseType(Object) + EVERYTHING with configurable trusted package prefixes and NON_FINAL; default INSTANCE restricts to com.cajunsystems.*, java.*, javax.* Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add 29-2-PLAN.md and 29-2-SUMMARY.md documenting the three P1 fixes (double-counted metrics, silent SerializationException, Jackson RCE). Update STATE.md decisions and ROADMAP.md Phase 29 section. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
30-1: ClusterConfiguration builder + ClusterManagementApi interface +
listNodes/listActors read ops + 10 unit tests
30-2: migrateActor + drainNode implementations + drain-and-rejoin tests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…erManagementApi Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e() into ClusterActorSystem Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d-op tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ad-ops plan Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…agementApi migrateActor: validates target via metadata store (authoritative, not in-memory knownNodes cache), updates etcd assignment, invalidates TTL cache, stops local actor when this node is the source so state is persisted before lazy recovery. drainNode: queries live node list from etcd, excludes drained node, distributes actors via RendezvousHashing, runs migrations in parallel, swallows per-actor failures (best-effort drain), fails fast if no other nodes available. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Moves the ConcurrentHashMap-backed MetadataStore stub from an inner class in ClusterManagementApiReadTest to a package-private top-level class, available to all cluster test files without duplication. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…g etcd assignment ClusterActorSystem.shutdown(actorId) deletes the actor's etcd entry as part of unregistering from the cluster. This caused migrateActor to write the new target assignment then immediately delete it when stopping the local mailbox. Add shutdownLocalOnly() (package-private) that calls super.shutdown() only, and use it in DefaultClusterManagementApi.migrateActor() so the new assignment written to etcd is preserved after the local actor stops. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4 migrate tests: updatesMetadataStore, unknownTarget, notCurrentlyAssigned, listActors reflects new assignment. 5 drain tests: migratesAll, emptyNode, singleNodeFails, consistentHashing distribution, drain-and-rejoin cycle. All use InMemoryMetadataStore + NoopMessagingSystem — no etcd required. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…plan Phase 30 fully complete. Add 30-2-SUMMARY documenting the shutdownLocalOnly deviation. Update STATE.md decisions and ROADMAP.md Phase 30 to ✅. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
31-1: Shared test helper extraction (WatchableInMemoryMetadataStore +
InMemoryMessagingSystem) + 3 chaos tests + 3 lifecycle/drain tests
31-2: cluster_mode.md rewrite + cluster-deployment.md + cluster-serialization.md
+ 100-msg Redis state recovery test + ClusterStatefulRecoveryExample
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ssagingSystem as shared test helpers Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs fixed while writing the cluster lifecycle integration tests: 1. shutdownLocalOnly() leak: Actor.stop() always calls system.shutdown(actorId) which deleted the metadata entry — defeating the purpose of shutdownLocalOnly. Fix: track actors in skipMetadataDeleteActors; shutdown(actorId) skips the delete when the flag is set. 2. System stop deletes actor metadata: ClusterActorSystem.stop() called super.shutdown() which stopped each actor, triggering Actor.stop() → system.shutdown(actorId) → metadataStore.delete(). This erased assignments before the leader could reassign them to surviving nodes. Fix: suppress metadata deletion for all local actors during stop(); node key deletion already signals departure — individual actor metadata is left for the leader to redistribute. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ic TestActor classes
Replaces the original 141-line stub with a ~300-line document covering ClusterConfiguration builder, ClusterManagementApi, ClusterMetrics, NodeCircuitBreaker, TTL cache, Redis persistence, and serialization. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Expanded from 141 to ~300 lines. Covers cross-node state recovery (Phase 26+), ClusterManagementApi with rolling upgrade workflow, NodeCircuitBreaker, ClusterMetrics, health check, graceful degradation, and an updated production checklist. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New ~200-line guide covering etcd cluster setup, Redis durability config, JVM startup bootstrap, health monitoring with Prometheus, rolling upgrade procedure, Kubernetes StatefulSet example, and a troubleshooting section. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New ~150-line guide covering KryoSerializationProvider vs JsonSerializationProvider tradeoffs, security model, schema evolution with Kryo, Kryo-to-JSON migration procedure, and verification steps. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…erStateTest New testStatefulActorFullLifecycle_100Messages verifies that a StatefulActor accumulating 100 Redis-journaled messages fully recovers on a new cluster node, then correctly processes additional messages (expected total: 101). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Self-contained example demonstrating stateful actor recovery across a simulated cluster failover. Uses inline SimpleMetadataStore and SimpleMessagingSystem (no real etcd/gRPC needed), RedisPersistenceProvider, 20 increments on system1, drainNode + stop, then recovery and 5 more increments on system2 verified at count=25. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Mark 31-2 as complete in PLAN.md, create 31-2-SUMMARY.md with commit hashes and deviations, update STATE.md (Phase 31 complete, decisions logged), mark 31-2 as done in ROADMAP.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.