PHOENIX-7787 Make CCF HAGroupStore ZK Updates backward compatible with existing ZK based client by ritegarg · Pull Request #2479 · apache/phoenix

ritegarg · 2026-05-16T23:35:29Z

What changes were proposed in this pull request?

Adds an opt-in sync from /phoenix/consistentHA/<G> to the legacy /phoenix/ha/<G> znode in
each region server's HAGroupStoreClient. Each client derives a combined ClusterRoleRecord
(local role + peer role) and writes it to its local /phoenix/ha via ZK stat-version CAS, so
pre-consistentHA ZK-registry clients keep reading a current view of the HA group.

New config keys (in QueryServices / QueryServicesOptions):

phoenix.ha.legacy.crr.sync.enabled — master switch (default false).
phoenix.ha.legacy.crr.reconciliation.interval.seconds — periodic reconciliation cadence
(default 60; 0 disables only the periodic loop, the event-driven path still runs).

Convergence model:

Event-driven on consistentHA CHILD_ADDED / CHILD_UPDATED (LOCAL or PEER), offloaded onto
a per-group single-thread ScheduledExecutorService so Curator's per-namespace event
dispatcher never blocks on ZK or JDBC I/O.
Periodic reconciler on the same executor (0–30s jitter on first run).
Curator NodeCache watcher on the legacy znode itself, so per-sync reads are in-memory
rather than a ZK round-trip. The NodeCache is eventually consistent, so on apparent absence
the sync path falls back to one authoritative getData() to disambiguate "absent" from
"not-yet-cached" before deciding CREATE_NEW vs CAS_WITH_VERSION.
On CAS BadVersion: log at INFO and bail. The next event/periodic cycle reconverges. No
inline retry.

Code-level changes:

ClusterRoleRecord: new explicit-RegistryType ctor; @JsonCreator now round-trips
registryType (legacy znode is always written with RegistryType.ZK for backward
compatibility, and must read back as ZK). New isLogicallyEqualIgnoringVersionAndRegistry
helper used as the rewrite short-circuit.
PhoenixHAAdmin:
- Three new helpers on the legacy namespace:
  - getClusterRoleRecordAndStatInZooKeeper — atomic (data, stat) read via
    storingStatIn.
  - createOrUpdateClusterRoleRecordWithCAS(name, record, LegacyCrrWriteMode mode, int expectedStatVersion) — typed write API. LegacyCrrWriteMode is a Phoenix-internal
    enum with values CREATE_NEW, FORCE_OVERWRITE (@VisibleForTesting),
    CAS_WITH_VERSION; the int argument is meaningful only for CAS_WITH_VERSION
    (validated >= 0 at entry). Both BadVersion (CAS lost) and NodeExists (create
    lost) surface as StaleClusterRoleRecordVersionException so callers retry
    uniformly.
- Atomic-read fix on the existing getHAGroupStoreRecordInZooKeeper (replaces a
  pre-existing 2-RPC pattern that could return a stat version not matching its data
  bytes).
HAGroupStoreClient: legacyHaAdmin + legacyCrrNodeCache + dedicated periodic executor.
Listener hooks for CHILD_ADDED / CHILD_UPDATED. CHILD_REMOVED is intentionally a
no-op — the legacy /phoenix/ha znode is never deleted by this client.
New StaleClusterRoleRecordVersionException (CRR-specific analog of
StaleHAGroupStoreRecordVersionException).

Concurrency / lifecycle invariants:

All sync invocations (initial, event-driven, periodic) route through the single-thread
executor, so they are serialized naturally; no per-instance lock is held across ZK I/O.
The initial sync is submitted to the executor rather than invoked inline in the constructor,
so a fresh HAGroupStoreClient does not block getInstanceForZkUrl's static class monitor
on ZK / JDBC I/O.
syncLegacyCRRIfRoleChanged snapshots legacyHaAdmin / legacyCrrNodeCache to local
finals before touching them, so a racing close() cannot NPE an in-flight call or push
writes through a torn-down Curator client.
close() marks the instance isHealthy=false and removes it from the static instances
map up front, so a concurrent getInstanceForZkUrl cannot hand out a half-closed client.

Back-compat and divergence guards:

Refuse to overwrite a non-ZK admin-managed legacy CRR (typically RegistryType.RPC with
master-form URLs). Existing readers feed the stored URLs through
JDBCUtil.formatUrl(url, roleRecord.getRegistryType()) to build connection strings;
swapping registryType + URLs out from under them would silently change their target. The
sync skips with a WARN and requires an explicit admin migration to ZK form before
proceeding.
Wait for a non-blank peer ZK URL on the local record before building the desired CRR;
ClusterRoleRecord's ctor would otherwise NPE on null url2 via JDBCUtil.formatUrl.
When the local peer cache returns null, preserve existing.role2 in the desired CRR
rather than downgrading to UNKNOWN. UNKNOWN is strictly less information than a
previously-recorded role, and downgrading would cause a multi-RS write storm where peers
with differing visibility CAS-overwrite each other.

Why are the changes needed?

The Consistent Cluster Failover work moved HA group state from /phoenix/ha (which held a
ClusterRoleRecord) to /phoenix/consistentHA (which holds an HAGroupStoreRecord and uses
an RPC-registry view of CRRs). Existing Phoenix clients built against the older ZK-registry
contract still read /phoenix/ha/<G> and expect:

The znode to exist for any live HA group.
The record's registryType to be ZK.
The roles to reflect the latest state of both clusters.

Without this patch, after a CCF rollout the /phoenix/ha znode goes stale or absent, and old
ZK-registry clients break. This change keeps /phoenix/ha in sync with the new authoritative
/phoenix/consistentHA records on a per-RS basis, gated by an opt-in flag so it only runs
where the operator needs it.

Does this PR introduce any user-facing change?

Yes, but the new behavior is gated entirely behind the new master switch
(phoenix.ha.legacy.crr.sync.enabled), which defaults to false.

When the flag is enabled:

Each HAGroupStoreClient writes to (and maintains) the legacy /phoenix/ha/<G> znode on
its local ZK.
Old pre-consistentHA ZK-registry clients pointed at /phoenix/ha will see fresh records
again.
The flag is read once at construction; toggling it off requires a process restart. This is
documented in HAGroupStoreClient's class Javadoc.
Rollout constraint: the persisted legacy znode now carries the new registryType
field. Pre-PR clients using the default Jackson ObjectMapper
(FAIL_ON_UNKNOWN_PROPERTIES=true) will reject the new bytes. All readers should be
upgraded to a tolerant build before enabling this flag on any writer.

One incidental visible change independent of the flag: ClusterRoleRecord JSON
deserialization now respects the persisted registryType field. Previously it was hard-coded
to RPC regardless of JSON content. Existing JSON without a registryType field still
defaults to RPC, so deployments not exposed to legacy-sync writes see no behavior change.

How was this patch tested?

Integration tests in HAGroupStoreClientIT (40 cases, all passing) covering: feature-off
no-op, initial create, role-changing transition, state-only transition (no rewrite),
peer-driven role flip, peer-absent → role2=UNKNOWN → recovery, CAS error mapping +
write-mode dispatch (CREATE_NEW, FORCE_OVERWRITE, CAS_WITH_VERSION negative-version
rejection), LOCAL/PEER CHILD_REMOVED (znode preserved), registryType stays ZK across
multiple role cycles, reconciliation interval =0 (event-driven still works), periodic-loop
recovery after an external divergence, and the two halves of the peer-view contract:
peer absent → existing role2 preserved (no write fires), peer present → stale role2
reconciles to live state.
Unit tests in ClusterRoleRecordTest (21 cases, all passing) covering JSON backward-compat
(missing registryType defaults to RPC, explicit RPC round-trips, explicit ZK round-trips,
full ZK toJson/fromJson round-trip preserves registryType) and the
isLogicallyEqualIgnoringVersionAndRegistry helper (null, same-modulo-version, ZK-vs-RPC
URL form, role mismatch, name mismatch). Backed by three JSON fixtures under
phoenix-core/src/test/resources/json/.
mvn spotless:check clean across the touched modules.
Ran all ITs locally, no new failures due to this PR

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude (Anthropic) and Cursor (Claude).

…h existing ZK based client

ClusterRoleRecord's ctor canonicalizes url1/url2 by lexical order, so existing.getRole2() returns the role for whichever URL sorts larger - not necessarily the peer URL. When the local peer cache is empty and we were trying to preserve existing peer role, we'd half the time inherit the local role instead, then the equality check would see a "logical change" and trigger a redundant CAS write. Manifests as flakes (deployment-dependent on ZK URL alphabetical order) in HAGroupStoreClientIT.testLegacyCrrSyncPreservesPreSeededRole2WhenPeerMissing and testLegacyCrrSyncStateOnlyChangeDoesNotRewriteLegacy under the full IT suite; clean 40/40 across two consecutive full-verify runs with the fix. Co-authored-by: Cursor <cursoragent@cursor.com>

Pure formatting changes from `mvn spotless:apply` across the touched files; no behavior change. Generated-by: Cursor (Claude). Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

This PR adds an opt-in mechanism for region servers to keep the legacy ZooKeeper-based HA ClusterRoleRecord (/phoenix/ha/<group>) synchronized from the newer consistentHA state (/phoenix/consistentHA/<group>), preserving compatibility for older ZK-registry Phoenix clients.

Changes:

Introduces feature-flagged legacy CRR sync in HAGroupStoreClient, including event-driven updates plus optional periodic reconciliation on a dedicated single-thread executor.
Adds PhoenixHAAdmin helpers for atomic read of (data, stat) and typed legacy CRR writes with explicit CAS modes and consistent stale-version exception handling.
Updates ClusterRoleRecord JSON handling to round-trip registryType, adds a logical-equality helper, and expands unit/integration test coverage with new JSON fixtures.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
phoenix-core/src/test/resources/json/test_role_record_no_registry_type.json	Adds JSON fixture for deserialization when `registryType` is absent.
phoenix-core/src/test/resources/json/test_role_record_explicit_zk.json	Adds JSON fixture for explicit `registryType=ZK` round-trip tests.
phoenix-core/src/test/resources/json/test_role_record_explicit_rpc.json	Adds JSON fixture for explicit `registryType=RPC` round-trip tests.
phoenix-core/src/test/java/org/apache/phoenix/jdbc/ClusterRoleRecordTest.java	Adds tests for `registryType` JSON behavior and logical equality helper.
phoenix-core/src/it/java/org/apache/phoenix/jdbc/HAGroupStoreClientIT.java	Adds IT coverage for legacy `/phoenix/ha` CRR sync behavior and edge cases.
phoenix-core-client/src/main/java/org/apache/phoenix/schema/MetaDataClient.java	Comment reflow only.
phoenix-core-client/src/main/java/org/apache/phoenix/query/QueryServicesOptions.java	Adds defaults for new legacy CRR sync configuration keys.
phoenix-core-client/src/main/java/org/apache/phoenix/query/QueryServices.java	Defines new config key constants for legacy CRR sync.
phoenix-core-client/src/main/java/org/apache/phoenix/jdbc/PhoenixHAAdmin.java	Adds atomic read helper and CAS write API for legacy CRR; fixes HAGroupStoreRecord atomic read.
phoenix-core-client/src/main/java/org/apache/phoenix/jdbc/HAGroupStoreClient.java	Implements opt-in legacy CRR sync, NodeCache, executor lifecycle handling, and instance deregistration.
phoenix-core-client/src/main/java/org/apache/phoenix/jdbc/ClusterRoleRecord.java	Makes `registryType` round-trip via JSON, adds explicit ctor and logical equality helper.
phoenix-core-client/src/main/java/org/apache/phoenix/exception/StaleClusterRoleRecordVersionException.java	New exception type for stale legacy CRR CAS operations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ritegarg · 2026-05-22T20:30:51Z

+            // Offload the legacy CRR sync (it does ZK + JDBC I/O) so we don't block
+            // Curator's per-namespace event dispatcher.
+            ScheduledExecutorService syncExec = legacyCrrSyncExecutor;
+            if (syncExec != null) {
+              syncExec.execute(this::syncLegacyCRRIfRoleChanged);


Agreed — close() shuts the executor down without coordinating with Curator's event thread, so an in-flight listener can race. Will wrap the syncExec.execute(...) in a try/catch (RejectedExecutionException) that no-ops at DEBUG level during shutdown.

ritegarg · 2026-05-22T20:25:00Z

  /**
   * Shuts down the periodic sync executor gracefully.
   */
+  /**
+   * Remove this instance from the static {@link #instances} map. Idempotent. Uses value-based


Good catch — leftover from the refactor that replaced shutdownSyncExecutor's old position with deregisterFromInstances. Removed the orphan block.

ritegarg · 2026-05-22T20:30:31Z

+  /**
+   * Equality on the six identity/role fields ({@code haGroupName, policy, url1, url2, role1,
+   * role2}); ignores {@code version} (always bumps) and {@code registryType} (avoids RPC->ZK
+   * thrash). Returns {@code false} if {@code other} is {@code null}.
+   */
+  public boolean isLogicallyEqualIgnoringVersionAndRegistry(ClusterRoleRecord other) {
+    if (other == null) {
+      return false;
+    }
+    return Objects.equals(haGroupName, other.haGroupName) && Objects.equals(policy, other.policy)
+      && Objects.equals(url1, other.url1) && Objects.equals(url2, other.url2)
+      && role1 == other.role1 && role2 == other.role2;
+  }


You're right that the URL-normalization point makes the "avoids RPC->ZK thrash" wording wrong: ZK-form and RPC-form URLs are different strings even for the same underlying endpoint, so the helper would return false across registry types. In practice the only caller is the sync path, which gates on shouldWriteLegacyCrr first and only ever compares two ZK-form records. Updated the javadoc to describe the actual contract (ignore version and the registryType field; same-registry callers only) and drop the misleading thrash claim. No logic change.

tkhurana · 2026-05-19T21:00:33Z

@@ -983,32 +1017,65 @@ private void closePeerConnection() {
  /**
   * Shuts down the periodic sync executor gracefully.
   */


Stale javadoc ?

Good catch — leftover from the refactor that replaced shutdownSyncExecutor's old position with deregisterFromInstances. Removed the orphan block.

apurtell · 2026-05-19T22:55:16Z

+    if (bucket == null) {
+      return;
+    }
+    bucket.remove(this.haGroupName, this);


Removing the inner map introduces a race condition that could cause a NPE.

close() is not synchronized so it can run concurrently with getInstanceForZkUrl's synchronized block, which includes this code:

instances.putIfAbsent(localZkUrl, new ConcurrentHashMap<>()); instances.get(localZkUrl).put(haGroupName, result);

Consider:

Thread A (getInstanceForZkUrl, in synchronized): instances.putIfAbsent(localZkUrl, …) — bucket already present, no-op.

Thread B (close on a sibling instance for the same zkUrl): bucket.remove(siblingHaGroupName, sibling) empties the bucket; computeIfPresent removes the bucket from the outer map.

Thread A: instances.get(localZkUrl) → null.

Thread A: Attempt to call method put(haGroupName, result) with a null reference throws NPE inside the static synchronized block.

The new client is created but never registered, the synchronized lookup is poisoned for the caller, and on the next call we leak ZK watchers + executors by recreating without knowing about the orphan.

Race is real — the deregister's bucket-removal can interleave between getInstanceForZkUrl's putIfAbsent and get. Fixing by making registration atomic at the ConcurrentHashMap level instead: replace the putIfAbsent + get().put(...) pair with a single instances.compute(localZkUrl, (k, v) -> ...) callback that creates the bucket if absent and inserts the new client in one CHM-atomic step. deregisterFromInstances's existing computeIfPresent is already CHM-atomic on the same key, so the two operations serialize via the outer CHM's per-bin lock; no new monitor needed. I considered the alternative of synchronizing deregisterFromInstances on HAGroupStoreClient.class to match the registration monitor, but it would make close() block on any in-flight constructor — including the slow paths (cache latch.await(...), JDBC bootstrap, peer cache init, NodeCache.start(true)) — which would worsen the head-of-line issue you raised on the NodeCache.start thread. Added a short comment at both call sites describing the CHM pairing so the invariant is locally visible.

apurtell · 2026-05-19T22:58:09Z

+  private boolean shouldWriteLegacyCrr(ClusterRoleRecord existing) {
+    // Refuse to overwrite a non-ZK (admin-managed RPC) legacy CRR; live readers use its
+    // registryType to build connection strings, so swapping form would break them.
+    if (existing != null && existing.getRegistryType() != RegistryType.ZK) {


Do pre-existing legacy CRRs without registryType get permanently locked out?

ClusterRoleRecord.fromJson deserializes pre-PR JSON without a registryType field as RegistryType.RPC. shouldWriteLegacyCrr then bails out here without updating anything.

The feature branch hasn't shipped, so every pre-existing /phoenix/ha/ record was written by apache:master or earlier. Master's toJson always emits registryType, so the three realistic record shapes all behave correctly:

Master ZK record ("registryType":"ZK" explicit) → parses cleanly → shouldWriteLegacyCrr true → sync proceeds.

Admin-managed RPC record ("registryType":"RPC" explicit) → parses as RPC → gate fires → sync skips (intended).

Pre-PHOENIX-7495 record (zk1/zk2 keys, no registryType) → strict Jackson rejects unknown fields → existing=null, stat populated → CAS_WITH_VERSION branch overwrites with a fresh ZK record.

The "no registryType AND RPC-parsable URL" combination can't be produced by any shipped writer, so no live record is locked out. Added testLegacyCrrSyncMigratesOlderZk1Zk2Record in HAGroupStoreClientIT (14/14 legacy-sync ITs green) covering the pre-PHOENIX-7495 path explicitly.

apurtell · 2026-05-19T23:02:08Z

+  private void setupLegacyCrrSync() throws Exception {
+    this.legacyHaAdmin = new PhoenixHAAdmin(this.zkUrl, conf, PHOENIX_HA_ZOOKEEPER_ZNODE_NAMESPACE);
+    this.legacyCrrNodeCache = new NodeCache(this.legacyHaAdmin.getCurator(), toPath(haGroupName));
+    this.legacyCrrNodeCache.start(true);


NodeCache.start() does a synchronous getData() call to ZK on this thread, which is called by the the constructor. All of this is synchronized on HAGroupStoreClient.class in getInstanceForZkUrl.

New instance creation across HA groups is serialized. If ZK is unhealthy or slow there will be a head-of-line blocking here.

Is this expected?

Switched to start(false) and posted an async rebuild() to legacyCrrSyncExecutor — keeps the cache warm for steady-state reads without holding the class monitor on ZK. First-sync cache miss is covered by the existing fallback at L1163-L1166; rebuild failure logs at DEBUG and is retried on the next sync cycle.

apurtell · 2026-05-19T23:07:23Z

+          existing != null ? existing.getVersion() : -1L, desired.getVersion());
+      } catch (StaleClusterRoleRecordVersionException stale) {
+        // CAS lost; next event/periodic cycle reconverges.
+        LOGGER.info("Legacy CRR CAS lost for HA group {} at expected stat version {}", haGroupName,


Drop this to DEBUG level.

Each RS independently runs the sync, so during a state transition every RS computes the same desired CRR and races on CAS. Only one can win and the other N−1 RSes will log this line at INFO level.

On a 500 node cluster, that is 499 unnecessary, imho, INFO level log lines every time this happens.

Agreed, fixing. Demoted the "CAS lost" line to DEBUG. The "Synced legacy CRR ... -> ..." line at the line above stays INFO since only the winner logs it (one INFO per state transition cluster-wide).

Co-authored-by: Cursor <cursoragent@cursor.com>

Ritesh Garg and others added 3 commits May 16, 2026 16:34

PHOENIX-7787 Make CCF HAGroupStore ZK Updates backward compatible wit…

49c1cec

…h existing ZK based client

PHOENIX-7787 Apply spotless formatting

3e95d01

Pure formatting changes from `mvn spotless:apply` across the touched files; no behavior change. Generated-by: Cursor (Claude). Co-authored-by: Cursor <cursoragent@cursor.com>

apurtell requested a review from Copilot May 19, 2026 22:29

Copilot started reviewing on behalf of apurtell May 19, 2026 22:29 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

tkhurana reviewed May 19, 2026

View reviewed changes

apurtell reviewed May 19, 2026

View reviewed changes

PHOENIX-7787 Address PR review comments

7344771

Co-authored-by: Cursor <cursoragent@cursor.com>

ritegarg requested review from apurtell and tkhurana May 22, 2026 20:31

Conversation

ritegarg commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ritegarg May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ritegarg commented May 16, 2026 •

edited

Loading

ritegarg May 22, 2026 •

edited

Loading