Skip to content

PHOENIX-7787 Make CCF HAGroupStore ZK Updates backward compatible with existing ZK based client#2479

Open
ritegarg wants to merge 4 commits into
apache:PHOENIX-7562-feature-newfrom
ritegarg:PHOENIX-7787-legacy-crr-sync
Open

PHOENIX-7787 Make CCF HAGroupStore ZK Updates backward compatible with existing ZK based client#2479
ritegarg wants to merge 4 commits into
apache:PHOENIX-7562-feature-newfrom
ritegarg:PHOENIX-7787-legacy-crr-sync

Conversation

@ritegarg
Copy link
Copy Markdown
Contributor

@ritegarg ritegarg commented May 16, 2026

What changes were proposed in this pull request?

Adds an opt-in sync from /phoenix/consistentHA/<G> to the legacy /phoenix/ha/<G> znode in
each region server's HAGroupStoreClient. Each client derives a combined ClusterRoleRecord
(local role + peer role) and writes it to its local /phoenix/ha via ZK stat-version CAS, so
pre-consistentHA ZK-registry clients keep reading a current view of the HA group.

New config keys (in QueryServices / QueryServicesOptions):

  • phoenix.ha.legacy.crr.sync.enabled — master switch (default false).
  • phoenix.ha.legacy.crr.reconciliation.interval.seconds — periodic reconciliation cadence
    (default 60; 0 disables only the periodic loop, the event-driven path still runs).

Convergence model:

  • Event-driven on consistentHA CHILD_ADDED / CHILD_UPDATED (LOCAL or PEER), offloaded onto
    a per-group single-thread ScheduledExecutorService so Curator's per-namespace event
    dispatcher never blocks on ZK or JDBC I/O.
  • Periodic reconciler on the same executor (0–30s jitter on first run).
  • Curator NodeCache watcher on the legacy znode itself, so per-sync reads are in-memory
    rather than a ZK round-trip. The NodeCache is eventually consistent, so on apparent absence
    the sync path falls back to one authoritative getData() to disambiguate "absent" from
    "not-yet-cached" before deciding CREATE_NEW vs CAS_WITH_VERSION.
  • On CAS BadVersion: log at INFO and bail. The next event/periodic cycle reconverges. No
    inline retry.

Code-level changes:

  • ClusterRoleRecord: new explicit-RegistryType ctor; @JsonCreator now round-trips
    registryType (legacy znode is always written with RegistryType.ZK for backward
    compatibility, and must read back as ZK). New isLogicallyEqualIgnoringVersionAndRegistry
    helper used as the rewrite short-circuit.
  • PhoenixHAAdmin:
    • Three new helpers on the legacy namespace:
      • getClusterRoleRecordAndStatInZooKeeper — atomic (data, stat) read via
        storingStatIn.
      • createOrUpdateClusterRoleRecordWithCAS(name, record, LegacyCrrWriteMode mode, int expectedStatVersion) — typed write API. LegacyCrrWriteMode is a Phoenix-internal
        enum with values CREATE_NEW, FORCE_OVERWRITE (@VisibleForTesting),
        CAS_WITH_VERSION; the int argument is meaningful only for CAS_WITH_VERSION
        (validated >= 0 at entry). Both BadVersion (CAS lost) and NodeExists (create
        lost) surface as StaleClusterRoleRecordVersionException so callers retry
        uniformly.
    • Atomic-read fix on the existing getHAGroupStoreRecordInZooKeeper (replaces a
      pre-existing 2-RPC pattern that could return a stat version not matching its data
      bytes).
  • HAGroupStoreClient: legacyHaAdmin + legacyCrrNodeCache + dedicated periodic executor.
    Listener hooks for CHILD_ADDED / CHILD_UPDATED. CHILD_REMOVED is intentionally a
    no-op — the legacy /phoenix/ha znode is never deleted by this client.
  • New StaleClusterRoleRecordVersionException (CRR-specific analog of
    StaleHAGroupStoreRecordVersionException).

Concurrency / lifecycle invariants:

  • All sync invocations (initial, event-driven, periodic) route through the single-thread
    executor, so they are serialized naturally; no per-instance lock is held across ZK I/O.
  • The initial sync is submitted to the executor rather than invoked inline in the constructor,
    so a fresh HAGroupStoreClient does not block getInstanceForZkUrl's static class monitor
    on ZK / JDBC I/O.
  • syncLegacyCRRIfRoleChanged snapshots legacyHaAdmin / legacyCrrNodeCache to local
    finals before touching them, so a racing close() cannot NPE an in-flight call or push
    writes through a torn-down Curator client.
  • close() marks the instance isHealthy=false and removes it from the static instances
    map up front, so a concurrent getInstanceForZkUrl cannot hand out a half-closed client.

Back-compat and divergence guards:

  • Refuse to overwrite a non-ZK admin-managed legacy CRR (typically RegistryType.RPC with
    master-form URLs). Existing readers feed the stored URLs through
    JDBCUtil.formatUrl(url, roleRecord.getRegistryType()) to build connection strings;
    swapping registryType + URLs out from under them would silently change their target. The
    sync skips with a WARN and requires an explicit admin migration to ZK form before
    proceeding.
  • Wait for a non-blank peer ZK URL on the local record before building the desired CRR;
    ClusterRoleRecord's ctor would otherwise NPE on null url2 via JDBCUtil.formatUrl.
  • When the local peer cache returns null, preserve existing.role2 in the desired CRR
    rather than downgrading to UNKNOWN. UNKNOWN is strictly less information than a
    previously-recorded role, and downgrading would cause a multi-RS write storm where peers
    with differing visibility CAS-overwrite each other.

Why are the changes needed?

The Consistent Cluster Failover work moved HA group state from /phoenix/ha (which held a
ClusterRoleRecord) to /phoenix/consistentHA (which holds an HAGroupStoreRecord and uses
an RPC-registry view of CRRs). Existing Phoenix clients built against the older ZK-registry
contract still read /phoenix/ha/<G> and expect:

  • The znode to exist for any live HA group.
  • The record's registryType to be ZK.
  • The roles to reflect the latest state of both clusters.

Without this patch, after a CCF rollout the /phoenix/ha znode goes stale or absent, and old
ZK-registry clients break. This change keeps /phoenix/ha in sync with the new authoritative
/phoenix/consistentHA records on a per-RS basis, gated by an opt-in flag so it only runs
where the operator needs it.

Does this PR introduce any user-facing change?

Yes, but the new behavior is gated entirely behind the new master switch
(phoenix.ha.legacy.crr.sync.enabled), which defaults to false.

When the flag is enabled:

  • Each HAGroupStoreClient writes to (and maintains) the legacy /phoenix/ha/<G> znode on
    its local ZK.
  • Old pre-consistentHA ZK-registry clients pointed at /phoenix/ha will see fresh records
    again.
  • The flag is read once at construction; toggling it off requires a process restart. This is
    documented in HAGroupStoreClient's class Javadoc.
  • Rollout constraint: the persisted legacy znode now carries the new registryType
    field. Pre-PR clients using the default Jackson ObjectMapper
    (FAIL_ON_UNKNOWN_PROPERTIES=true) will reject the new bytes. All readers should be
    upgraded to a tolerant build before enabling this flag on any writer.

One incidental visible change independent of the flag: ClusterRoleRecord JSON
deserialization now respects the persisted registryType field. Previously it was hard-coded
to RPC regardless of JSON content. Existing JSON without a registryType field still
defaults to RPC, so deployments not exposed to legacy-sync writes see no behavior change.

How was this patch tested?

  • Integration tests in HAGroupStoreClientIT (40 cases, all passing) covering: feature-off
    no-op, initial create, role-changing transition, state-only transition (no rewrite),
    peer-driven role flip, peer-absent → role2=UNKNOWN → recovery, CAS error mapping +
    write-mode dispatch (CREATE_NEW, FORCE_OVERWRITE, CAS_WITH_VERSION negative-version
    rejection), LOCAL/PEER CHILD_REMOVED (znode preserved), registryType stays ZK across
    multiple role cycles, reconciliation interval =0 (event-driven still works), periodic-loop
    recovery after an external divergence, and the two halves of the peer-view contract:
    peer absent → existing role2 preserved (no write fires), peer present → stale role2
    reconciles to live state.
  • Unit tests in ClusterRoleRecordTest (21 cases, all passing) covering JSON backward-compat
    (missing registryType defaults to RPC, explicit RPC round-trips, explicit ZK round-trips,
    full ZK toJson/fromJson round-trip preserves registryType) and the
    isLogicallyEqualIgnoringVersionAndRegistry helper (null, same-modulo-version, ZK-vs-RPC
    URL form, role mismatch, name mismatch). Backed by three JSON fixtures under
    phoenix-core/src/test/resources/json/.
  • mvn spotless:check clean across the touched modules.
  • Ran all ITs locally, no new failures due to this PR

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude (Anthropic) and Cursor (Claude).

Ritesh Garg and others added 3 commits May 16, 2026 16:34
ClusterRoleRecord's ctor canonicalizes url1/url2 by lexical order, so
existing.getRole2() returns the role for whichever URL sorts larger -
not necessarily the peer URL. When the local peer cache is empty and we
were trying to preserve existing peer role, we'd half the time inherit
the local role instead, then the equality check would see a "logical
change" and trigger a redundant CAS write.

Manifests as flakes (deployment-dependent on ZK URL alphabetical order)
in HAGroupStoreClientIT.testLegacyCrrSyncPreservesPreSeededRole2WhenPeerMissing
and testLegacyCrrSyncStateOnlyChangeDoesNotRewriteLegacy under the full
IT suite; clean 40/40 across two consecutive full-verify runs with the
fix.

Co-authored-by: Cursor <cursoragent@cursor.com>
Pure formatting changes from `mvn spotless:apply` across the touched files;
no behavior change.

Generated-by: Cursor (Claude).
Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an opt-in mechanism for region servers to keep the legacy ZooKeeper-based HA ClusterRoleRecord (/phoenix/ha/<group>) synchronized from the newer consistentHA state (/phoenix/consistentHA/<group>), preserving compatibility for older ZK-registry Phoenix clients.

Changes:

  • Introduces feature-flagged legacy CRR sync in HAGroupStoreClient, including event-driven updates plus optional periodic reconciliation on a dedicated single-thread executor.
  • Adds PhoenixHAAdmin helpers for atomic read of (data, stat) and typed legacy CRR writes with explicit CAS modes and consistent stale-version exception handling.
  • Updates ClusterRoleRecord JSON handling to round-trip registryType, adds a logical-equality helper, and expands unit/integration test coverage with new JSON fixtures.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
phoenix-core/src/test/resources/json/test_role_record_no_registry_type.json Adds JSON fixture for deserialization when registryType is absent.
phoenix-core/src/test/resources/json/test_role_record_explicit_zk.json Adds JSON fixture for explicit registryType=ZK round-trip tests.
phoenix-core/src/test/resources/json/test_role_record_explicit_rpc.json Adds JSON fixture for explicit registryType=RPC round-trip tests.
phoenix-core/src/test/java/org/apache/phoenix/jdbc/ClusterRoleRecordTest.java Adds tests for registryType JSON behavior and logical equality helper.
phoenix-core/src/it/java/org/apache/phoenix/jdbc/HAGroupStoreClientIT.java Adds IT coverage for legacy /phoenix/ha CRR sync behavior and edge cases.
phoenix-core-client/src/main/java/org/apache/phoenix/schema/MetaDataClient.java Comment reflow only.
phoenix-core-client/src/main/java/org/apache/phoenix/query/QueryServicesOptions.java Adds defaults for new legacy CRR sync configuration keys.
phoenix-core-client/src/main/java/org/apache/phoenix/query/QueryServices.java Defines new config key constants for legacy CRR sync.
phoenix-core-client/src/main/java/org/apache/phoenix/jdbc/PhoenixHAAdmin.java Adds atomic read helper and CAS write API for legacy CRR; fixes HAGroupStoreRecord atomic read.
phoenix-core-client/src/main/java/org/apache/phoenix/jdbc/HAGroupStoreClient.java Implements opt-in legacy CRR sync, NodeCache, executor lifecycle handling, and instance deregistration.
phoenix-core-client/src/main/java/org/apache/phoenix/jdbc/ClusterRoleRecord.java Makes registryType round-trip via JSON, adds explicit ctor and logical equality helper.
phoenix-core-client/src/main/java/org/apache/phoenix/exception/StaleClusterRoleRecordVersionException.java New exception type for stale legacy CRR CAS operations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +916 to +920
// Offload the legacy CRR sync (it does ZK + JDBC I/O) so we don't block
// Curator's per-namespace event dispatcher.
ScheduledExecutorService syncExec = legacyCrrSyncExecutor;
if (syncExec != null) {
syncExec.execute(this::syncLegacyCRRIfRoleChanged);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — close() shuts the executor down without coordinating with Curator's event thread, so an in-flight listener can race. Will wrap the syncExec.execute(...) in a try/catch (RejectedExecutionException) that no-ops at DEBUG level during shutdown.

Comment on lines +1017 to +1021
/**
* Shuts down the periodic sync executor gracefully.
*/
/**
* Remove this instance from the static {@link #instances} map. Idempotent. Uses value-based
Copy link
Copy Markdown
Contributor Author

@ritegarg ritegarg May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — leftover from the refactor that replaced shutdownSyncExecutor's old position with deregisterFromInstances. Removed the orphan block.

Comment on lines +244 to +256
/**
* Equality on the six identity/role fields ({@code haGroupName, policy, url1, url2, role1,
* role2}); ignores {@code version} (always bumps) and {@code registryType} (avoids RPC->ZK
* thrash). Returns {@code false} if {@code other} is {@code null}.
*/
public boolean isLogicallyEqualIgnoringVersionAndRegistry(ClusterRoleRecord other) {
if (other == null) {
return false;
}
return Objects.equals(haGroupName, other.haGroupName) && Objects.equals(policy, other.policy)
&& Objects.equals(url1, other.url1) && Objects.equals(url2, other.url2)
&& role1 == other.role1 && role2 == other.role2;
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that the URL-normalization point makes the "avoids RPC->ZK thrash" wording wrong: ZK-form and RPC-form URLs are different strings even for the same underlying endpoint, so the helper would return false across registry types. In practice the only caller is the sync path, which gates on shouldWriteLegacyCrr first and only ever compares two ZK-form records. Updated the javadoc to describe the actual contract (ignore version and the registryType field; same-registry callers only) and drop the misleading thrash claim. No logic change.

@@ -983,32 +1017,65 @@ private void closePeerConnection() {
/**
* Shuts down the periodic sync executor gracefully.
*/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale javadoc ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — leftover from the refactor that replaced shutdownSyncExecutor's old position with deregisterFromInstances. Removed the orphan block.

if (bucket == null) {
return;
}
bucket.remove(this.haGroupName, this);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the inner map introduces a race condition that could cause a NPE.

close() is not synchronized so it can run concurrently with getInstanceForZkUrl's synchronized block, which includes this code:

            instances.putIfAbsent(localZkUrl, new ConcurrentHashMap<>());
            instances.get(localZkUrl).put(haGroupName, result);

Consider:

  1. Thread A (getInstanceForZkUrl, in synchronized): instances.putIfAbsent(localZkUrl, …) — bucket already present, no-op.
  2. Thread B (close on a sibling instance for the same zkUrl): bucket.remove(siblingHaGroupName, sibling) empties the bucket; computeIfPresent removes the bucket from the outer map.
  3. Thread A: instances.get(localZkUrl) → null.
  4. Thread A: Attempt to call method put(haGroupName, result) with a null reference throws NPE inside the static synchronized block.

The new client is created but never registered, the synchronized lookup is poisoned for the caller, and on the next call we leak ZK watchers + executors by recreating without knowing about the orphan.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race is real — the deregister's bucket-removal can interleave between getInstanceForZkUrl's putIfAbsent and get. Fixing by making registration atomic at the ConcurrentHashMap level instead: replace the putIfAbsent + get().put(...) pair with a single instances.compute(localZkUrl, (k, v) -> ...) callback that creates the bucket if absent and inserts the new client in one CHM-atomic step. deregisterFromInstances's existing computeIfPresent is already CHM-atomic on the same key, so the two operations serialize via the outer CHM's per-bin lock; no new monitor needed. I considered the alternative of synchronizing deregisterFromInstances on HAGroupStoreClient.class to match the registration monitor, but it would make close() block on any in-flight constructor — including the slow paths (cache latch.await(...), JDBC bootstrap, peer cache init, NodeCache.start(true)) — which would worsen the head-of-line issue you raised on the NodeCache.start thread. Added a short comment at both call sites describing the CHM pairing so the invariant is locally visible.

private boolean shouldWriteLegacyCrr(ClusterRoleRecord existing) {
// Refuse to overwrite a non-ZK (admin-managed RPC) legacy CRR; live readers use its
// registryType to build connection strings, so swapping form would break them.
if (existing != null && existing.getRegistryType() != RegistryType.ZK) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do pre-existing legacy CRRs without registryType get permanently locked out?

ClusterRoleRecord.fromJson deserializes pre-PR JSON without a registryType field as RegistryType.RPC. shouldWriteLegacyCrr then bails out here without updating anything.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature branch hasn't shipped, so every pre-existing /phoenix/ha/ record was written by apache:master or earlier. Master's toJson always emits registryType, so the three realistic record shapes all behave correctly:

  • Master ZK record ("registryType":"ZK" explicit) → parses cleanly → shouldWriteLegacyCrr true → sync proceeds.
  • Admin-managed RPC record ("registryType":"RPC" explicit) → parses as RPC → gate fires → sync skips (intended).
  • Pre-PHOENIX-7495 record (zk1/zk2 keys, no registryType) → strict Jackson rejects unknown fields → existing=null, stat populated → CAS_WITH_VERSION branch overwrites with a fresh ZK record.

The "no registryType AND RPC-parsable URL" combination can't be produced by any shipped writer, so no live record is locked out. Added testLegacyCrrSyncMigratesOlderZk1Zk2Record in HAGroupStoreClientIT (14/14 legacy-sync ITs green) covering the pre-PHOENIX-7495 path explicitly.

private void setupLegacyCrrSync() throws Exception {
this.legacyHaAdmin = new PhoenixHAAdmin(this.zkUrl, conf, PHOENIX_HA_ZOOKEEPER_ZNODE_NAMESPACE);
this.legacyCrrNodeCache = new NodeCache(this.legacyHaAdmin.getCurator(), toPath(haGroupName));
this.legacyCrrNodeCache.start(true);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeCache.start() does a synchronous getData() call to ZK on this thread, which is called by the the constructor. All of this is synchronized on HAGroupStoreClient.class in getInstanceForZkUrl.

New instance creation across HA groups is serialized. If ZK is unhealthy or slow there will be a head-of-line blocking here.

Is this expected?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to start(false) and posted an async rebuild() to legacyCrrSyncExecutor — keeps the cache warm for steady-state reads without holding the class monitor on ZK. First-sync cache miss is covered by the existing fallback at L1163-L1166; rebuild failure logs at DEBUG and is retried on the next sync cycle.

existing != null ? existing.getVersion() : -1L, desired.getVersion());
} catch (StaleClusterRoleRecordVersionException stale) {
// CAS lost; next event/periodic cycle reconverges.
LOGGER.info("Legacy CRR CAS lost for HA group {} at expected stat version {}", haGroupName,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop this to DEBUG level.

Each RS independently runs the sync, so during a state transition every RS computes the same desired CRR and races on CAS. Only one can win and the other N−1 RSes will log this line at INFO level.

On a 500 node cluster, that is 499 unnecessary, imho, INFO level log lines every time this happens.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, fixing. Demoted the "CAS lost" line to DEBUG. The "Synced legacy CRR ... -> ..." line at the line above stays INFO since only the winner logs it (one INFO per state transition cluster-wide).

Co-authored-by: Cursor <cursoragent@cursor.com>
@ritegarg ritegarg requested review from apurtell and tkhurana May 22, 2026 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants