Skip to content

fix(search): restore live settings on per-entity promote path#27920

Merged
harshach merged 29 commits intomainfrom
harshach/search-alias-promote
May 6, 2026
Merged

fix(search): restore live settings on per-entity promote path#27920
harshach merged 29 commits intomainfrom
harshach/search-alias-promote

Conversation

@harshach
Copy link
Copy Markdown
Collaborator

@harshach harshach commented May 5, 2026

Summary

Reindex promotion succeeds but the swapped-in index keeps the bulk-build settings (refresh_interval=-1, replicas=0, translog=async), so live writes after promotion are buffered indefinitely and never become searchable until a manual _refresh. This surfaces in production as "indexing is fast, but searches/UIs return nothing on freshly created entities."

Two coupled bugs in the per-entity distributed promote path:

  • DefaultRecreateHandler.promoteEntityIndex was missing the live-settings restore. finalizeReindex calls applyLiveServingSettings + maybeForceMerge before the alias swap; the per-entity sibling jumped straight from doc-count check to alias swap. Mirror the call sequence.
  • DistributedSearchIndexExecutor constructed a second RecreateIndexHandler without wiring jobData. With jobData=null, applyLiveServingSettings resolves no overrides, buildRevertJson returns null, and the call silently no-ops — so even if the previous fix were in place, the per-entity path would still miss settings restoration. Pass currentJob.getJobConfiguration() into the handler at construction, mirroring what DistributedIndexingStrategy already does.

Provenance

These two source changes are cherry-picks of b272de85f9 and 68dad610ed, which currently sit unmerged on the context_center feature branch (PR #27558) — a Knowledge → Context Center rename PR that is unrelated to search indexing. Splitting them out so they can ship independently.

A regression test (testPromoteEntityIndexAppliesLiveServingSettingsBeforeSwap) is added to lock the behavior in: it builds a handler with bulk overrides on EventPublisherJob, calls promoteEntityIndex, and asserts searchClient.updateIndexSettings is invoked with refresh_interval=1s, number_of_replicas=1, translog.durability=request before the alias swap. Verified locally: the test passes with both source commits applied; reverting either one fails the assertion with Wanted but not invoked.

Test plan

  • mvn test -pl openmetadata-service -Dtest='DefaultRecreateHandlerTest,DistributedSearchIndexExecutorTest,DistributedIndexingStrategyTest,SearchIndexExecutorControlFlowTest' → 136 tests, 0 failures
  • Sanity-check the regression test fails when applyLiveServingSettings/maybeForceMerge are removed from promoteEntityIndex (Mockito reports Wanted but not invoked)
  • Manual: trigger a Search Indexing app run with bulk overrides configured (replicas=0, refresh=-1, translog=async) and confirm post-promotion _settings show refresh_interval=1s, not -1

🤖 Generated with Claude Code


Summary by Gitar

  • Search index cleanup:
    • Modified DefaultRecreateHandler to prevent unnecessary delete-by-alias-name operations on canonical indices.
    • Added guard logic to skip deletion when the index is an active alias, avoiding 30+ seconds of exponential backoff per entity.
  • Testing:
    • Added unit test testFinalizeReindexSkipsDeleteWhenCanonicalIsAlias to verify the cleanup guard and ensure correct alias migration to the staged index.

This will update automatically on new commits.

harshach and others added 3 commits May 5, 2026 13:59
DefaultRecreateHandler exposes two finalization paths:

  - finalizeReindex(...)        — centralized end-of-job promotion. Calls
                                  applyLiveServingSettings + maybeForceMerge
                                  before the alias swap, reverting the bulk
                                  overrides (refresh_interval=-1, replicas=0,
                                  async translog) back to live values
                                  (refresh=1s, replicas=1, durable translog).

  - promoteEntityIndex(ctx, ok) — per-entity promotion. Used by the distributed
                                  search-indexer's "promote as soon as all
                                  partitions for an entity complete" callback
                                  (DistributedSearchIndexExecutor.promoteEntityIndex).
                                  Swaps the alias and cleans up old indices —
                                  but never restored live settings.

When an entity finishes its partitions before the final reconciliation
(typically the smallest entities — e.g. knowledge `page` with ~11 rows),
its index is promoted via the per-entity path, the alias swap succeeds,
and the bulk-build overrides become the new live settings. refresh_interval
stays at -1 in production, so live writes after the reindex are buffered in
the translog and never reach searchable segments until a manual _refresh.
Externally this surfaces as "create an article, hierarchy is empty until I
re-trigger reindex" — exactly the user-reported bug.

Mirror the finalizeReindex sequence by calling applyLiveServingSettings
(and maybeForceMerge for parity) at the top of the promote block in
promoteEntityIndex, before the alias swap.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
DefaultRecreateHandler.applyLiveServingSettings reads from the handler's
jobData field (live + bulk index-settings overrides on the EventPublisherJob).
The per-entity distributed-promotion path in DistributedSearchIndexExecutor
created its own DefaultRecreateHandler instance and never called
withJobData(jobData) on it. With jobData=null, buildRevertJson returns null
and applyLiveServingSettings silently no-ops — meaning the previous fix
(b272de8) never actually re-applied live settings on the per-entity
promote path, even though the call was reached.

currentJob.getJobConfiguration() is the EventPublisherJob the strategy
created. Wire it into the new handler at construction time, mirroring the
withJobData call DistributedIndexingStrategy already makes on the strategy's
own handler instance.

With this change, the per-entity promote path now logs

  "Applying live serving settings to staged index '...' for entity 'page':
   {\"number_of_replicas\":1,\"refresh_interval\":\"1s\", ...}"

before the alias swap, and post-promotion `_settings` show
refresh_interval=1s instead of the stuck -1.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Asserts that DefaultRecreateHandler.promoteEntityIndex calls
searchClient.updateIndexSettings with the live-revert JSON
(refresh_interval=1s, replicas=1, translog.durability=request) before
swapping the alias, given a handler with bulk overrides wired through
withJobData. Without the two preceding fixes the assertion fails with
"Wanted but not invoked" — applyLiveServingSettings was never reached
on the per-entity promotion path, so the staged index inherited
refresh_interval=-1 and post-promotion live writes never became
searchable until a manual _refresh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 5, 2026 21:20
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels May 5, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes the distributed per-entity search-index promotion path so a promoted staged index is returned to live-serving settings before it starts serving reads/writes. This sits in the search reindexing flow and aligns the per-entity distributed promote path with the existing full finalizeReindex behavior.

Changes:

  • Restores live-serving index settings and optional force-merge in promoteEntityIndex before alias promotion.
  • Passes distributed job configuration into the recreate handler so per-entity promotion can resolve bulk/live index settings.
  • Adds a regression test around DefaultRecreateHandler.promoteEntityIndex.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
openmetadata-service/src/main/java/org/openmetadata/service/search/DefaultRecreateHandler.java Adds live-settings restore + force-merge to the per-entity promotion path.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedSearchIndexExecutor.java Wires currentJob configuration into the recreate handler used by distributed per-entity promotion.
openmetadata-service/src/test/java/org/openmetadata/service/search/DefaultRecreateHandlerTest.java Adds a regression test for live-settings restoration during per-entity promotion.

harshach and others added 2 commits May 5, 2026 14:37
DefaultRecreateHandlerTest.PromoteEntityIndexTests:
  - testPromoteEntityIndexAppliesSettingsBeforeAliasSwap: InOrder
    verification that updateIndexSettings runs BEFORE swapAliases. Catches a
    swap-then-revert misordering (which would briefly serve live reads against
    refresh=-1 settings).
  - testPromoteEntityIndexForceMergesWhenConfigured: forceMerge(staged, 1) is
    invoked when bulkIndexSettings.forceMergeOnPromote=true. Catches a
    regression where the force-merge call gets dropped without anyone
    noticing.
  - testPromoteEntityIndexSkipsSettingsWithoutJobData: locks in the safe no-op
    behavior when a handler is constructed without withJobData. Documents that
    no-jobData → no settings call (vs. crash or silent revert to defaults).

DistributedSearchIndexExecutorTest:
  - initializeEntityTrackerWiresJobDataIntoDefaultRecreateHandler: triggers
    the private initializeEntityTracker with currentJob holding a populated
    jobConfiguration and verifies recreateHandler.withJobData(jobData) is
    called on the per-entity handler. This catches the second half of the
    original regression: even if applyLiveServingSettings is reached on
    promoteEntityIndex, jobData=null makes it a silent no-op. Future edits
    that drop the wiring or move handler construction elsewhere will fail
    here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Triggers SearchIndexingApplication with bulkIndexSettings configured
(refresh_interval=-1, number_of_replicas=0, translog.durability=async),
waits for the run to terminate, then queries _settings on the promoted
table_search_index alias against the real OpenSearch/Elasticsearch
container (via TestSuiteBootstrap.createSearchClient()). Asserts that
each concrete index resolved by the alias has the live values applied
(refresh=1s, replicas=1, translog.durability=request) and not the bulk
overrides.

This is the end-to-end counterpart to the unit-level regression test in
DefaultRecreateHandlerTest. Catches the same class of bug at the layer
where it actually surfaced in production: an alias swap that completed
successfully according to logs but left the new live index unsearchable
because refresh was disabled and writes were buffered indefinitely.

Modeled on SearchIndexingFieldsParityIT for run-trigger / poll structure;
adds the post-completion _settings verification step that no other IT
performs today.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dant test

DefaultRecreateHandler: move applyLiveServingSettings + maybeForceMerge
inside the try/finally that unregisters the staged index. Without this, a
transient OS/ES failure during _settings update or _forcemerge propagated out
before the finally ran, leaving searchRepository.unregisterStagedIndex
permanently registered — so live writes kept routing to a staged index
nothing reads from. Same fix applied to finalizeReindex for consistency
(its window is shorter since it runs at end-of-job, but the leak shape is
identical). Per gitar-bot review.

DefaultRecreateHandlerTest:
  - testPromoteEntityIndexAppliesLiveServingSettingsBeforeSwap: replace
    independent verify() calls with InOrder so the test actually locks the
    "settings before alias swap" ordering its name and the PR description
    promise. A swap-then-revert refactor would have passed before this.
    Per Copilot review.
  - Drop testPromoteEntityIndexAppliesSettingsBeforeAliasSwap (the standalone
    InOrder test added in the previous commit) — folded back into the test
    above, which now covers both ordering and JSON content in one place.
  - Add testPromoteEntityIndexUnregistersStagedIndexOnSettingsFailure —
    regression test for the gitar-bot fix above. Verified to fail with
    "IllegalState connection reset" when the calls are moved back outside
    the try block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 5, 2026 21:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

The why-it's-in-a-try and why-jobData-is-wired blocks read like commit
messages, not code annotations. Tests and commit history carry the
rationale; the code itself reads fine without them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ic IT

Six findings from copilot-pull-request-reviewer.

1+4. Track canonicalDeleted via positive evidence only (DefaultRecreateHandler).
   ES/OS indexExists/getAliases/listIndicesByPrefix all swallow transport
   errors and return false/empty. A "negative" probe result cannot be
   distinguished from "probe failed". The previous shape blindly trusted
   probe values and could misclassify a transient failure as data loss
   when canonical was actually still serving (gate-says-no-but-actually-yes
   case) or, conversely, as not-data-loss when canonical was actually
   gone. Now: a `canonicalDeleted` boolean defaults to false and only
   flips to true after both the delete call and a positive post-state
   check confirm the index is gone. dataLoss classification uses this
   flag — never claim data loss without positive evidence canonical was
   deleted. Added regression test
   `testPromoteEntityIndexAmbiguousGateProbeIsNotDataLoss` for the gate-
   ambiguity case.

2. Wire onJobFailed in DistributedIndexingStrategy.
   Previously only SearchIndexExecutor emitted the listener callback for
   FAILED. The distributed strategy returned the status but never invoked
   listeners.onJobFailed, so jobData.failure stayed empty and the
   AppRunRecord/WebSocket update had no failure context. Now mirrors the
   single-server behavior with an IllegalStateException naming the
   data-loss entities.

3. IT: assertEquals("request", durability) instead of assertNotEquals
   ("async"). The non-equals assertion would pass if a silent translog-
   revert drop left durability at any non-async cluster default, missing
   the regression. Pin the exact configured value.

5. IT: assert exactly one concrete index resolves the alias, not
   at-least-one. A broken swap that leaves the alias attached to BOTH
   the pre-existing live index AND the new *_rebuild_* index would
   satisfy "any rebuild present" but duplicate search results in
   production. Use assertEquals(1, settingsByIndex.size()).

6. IT hermeticity: snapshot/restore SearchIndexingApplication's
   appConfiguration. /v1/apps/trigger/{name} merges payload into the
   persisted config, so without restore later tests in the suite
   inherit this test's bulkIndexSettings / liveIndexSettings /
   useDistributedIndexing values — making suite ordering change what
   they exercise. Both IT methods now do a try/finally with
   snapshotAppConfig() + restoreAppConfig().

All 151 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

harshach and others added 2 commits May 5, 2026 21:54
restoreAppConfig POSTs to /v1/apps/trigger/{name}, which both merges the
body into the persisted config AND starts a new run. Returning without
waiting left SearchIndexingApplication running into the next test class
(AppsResourceIT.test_triggerApp_200), which then timed out for 2 minutes
on "already running".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The terminal-state check compared status against uppercase "SUCCESS",
"FAILED", "COMPLETED", but appRunRecord.json defines the status enum
in lowercase ("success", "failed", "completed", ...). The check never
matched and the 30s wait silently fell through to the catch block,
making it a no-op. test_triggerApp_200 then relied on its 2-minute
"already running" trigger retry, which timed out whenever a longer
reindex (e.g. SearchIndexingFieldsParityIT's "all entities" reindex)
was still in flight.

Switch the terminal check to "not running and not started"
case-insensitively, and raise the ceiling to 5 minutes so the wait
actually covers a long in-flight reindex.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Comment on lines 134 to 138
if (canonicalIndex == null || stagedIndex == null) {
LOG.error(
"Cannot finalize reindex for entity '{}'. canonicalIndex={}, stagedIndex={}",
"Cannot promote index for entity '{}'. canonicalIndex={}, stagedIndex={}",
entityType,
canonicalIndex,
mohityadav766 and others added 4 commits May 6, 2026 12:28
…-promote

# Conflicts:
#	openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedSearchIndexExecutor.java
#	openmetadata-service/src/main/java/org/openmetadata/service/search/DefaultRecreateHandler.java
…ata/OpenMetadata into harshach/search-alias-promote
The test triggers SearchIndexingApplication and waits for it to
complete, but during the parallel-tests slot other classes can also
trigger the same app concurrently. The resulting in-flight run
then leaks into AppsResourceIT.test_triggerApp_200 (which runs in
the sequential slot) and exhausts its 2-minute "already running"
trigger Awaitility.

AppsResourceIT is already in the sequential bucket for the same
reason. Mirror it for SearchIndexAliasPromotionIT across all seven
failsafe profiles (mysql/postgres × elasticsearch/opensearch and the
RDF profile) — include in sequential-tests, exclude from
parallel-tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated no new comments.

harshach and others added 2 commits May 6, 2026 06:51
DefaultRecreateHandler.promote: when canonicalIndex is null but
stagedIndex is non-null, the early-return previously skipped the
finally clause that unregisters the staged index. Live writes could
stay routed to the staged index after we bail. Release the
registration explicitly before returning. Extend the existing
testPromoteWithNullCanonicalIndex unit test to assert the unregister
call.

SearchIndexAliasPromotionIT.snapshotAppConfig: distinguish "snapshot
failed" from "config absent". The previous Map.of() return on
exception caused restoreAppConfig to POST an empty body to
/v1/apps/trigger/{name}, which is a no-op merge that silently leaks
this test's bulk/live setting overrides into downstream tests and
starts a spurious app run. Return null on failure so the caller
short-circuits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The IT triggered the bundled SearchIndexingApplication, which made
it expensive (full reindex per run, ~1-2 minutes) and a source of
bleed-through into AppsResourceIT.test_triggerApp_200 even after
moving to the sequential failsafe bucket.

The same surface is already covered by unit tests:
- DefaultRecreateHandlerTest "Should restore live serving settings on
  staged index before alias swap" (and the per-step swap-order,
  data-loss-flagging, post-state-check tests)
- DistributedSearchIndexExecutorTest verifies the
  withJobData(jobConfig) wiring on the per-entity promotion handler

Drop the IT and revert its pom.xml include/exclude entries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

harshach and others added 2 commits May 6, 2026 07:09
DefaultRecreateHandler.promote: validate aliasesToAttach BEFORE
applyLiveServingSettings/maybeForceMerge so the empty-aliases error
path skips wasted I/O and segment churn on a staged index that will
never be swapped in. Extend testPromoteWithEmptyContextAliases to
verify updateIndexSettings and forceMerge are not invoked. Adjust
testPromoteEntityIndexUnregistersStagedIndexOnSettingsFailure to
populate canonicalAliases so it still reaches applyLiveServingSettings.

Refresh checkAliasAttached Javadoc: with positive-evidence
canonicalDeleted tracking, a null result is classified as data loss
only when canonicalDeleted=true; otherwise it is degraded (retryable).
The previous wording claimed every null was data loss, which no longer
matches the call site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On every reindex after the first, the canonical name (e.g.
openmetadata_table_search_index) is an alias on the previous staged
index — not a concrete index. The three-step swap then attempted
deleteIndexWithBackoff(canonicalAlias), which OS/ES rejects with
illegal_argument_exception ("matches an alias, specify the
corresponding concrete indices instead"). The 6-attempt exponential
backoff burned ~31s per entity. With 30+ default entity types the
SearchIndexingApplication ran past CI's 300s setup budget, blocking
every Playwright shard and python-integration setup at "Setup
OpenMetadata Test Environment".

Two-pronged fix in DefaultRecreateHandler.promote:

1. Detect canonicalIsAlias via getIndicesByAlias(canonicalIndex).
   When true, take the new promoteByAtomicAliasSwap path: a single
   swapAliases call atomically moves every alias (parents + canonical)
   from old → new staged. No name collision, no per-entity
   delete-by-alias-name, no degraded/data-loss windows.

2. listIndicesByPrefix returns the canonical alias name itself among
   its results (alongside concrete *_rebuild_* indices). Filter that
   out of oldIndicesToCleanup when canonicalIsAlias, so the cleanup
   loop's deleteIndexWithBackoff doesn't replay the same 31s backoff.
   Keep canonical in cleanup when it is concrete — that's where the
   first-reindex flow drops the original concrete after the three-step
   swap moves aliases off it.

Local repro: SearchIndexingApplication now completes in ~7s instead
of hanging past 300s. New unit test
testPromoteEntityIndexAtomicSwapWhenCanonicalIsAlias locks the new
shape (single swap, no delete-by-alias, old concrete cleaned up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
harshach and others added 4 commits May 6, 2026 10:59
Local repro on the CI-equivalent stack (mysql + elasticsearch + sample
data + ingestion) completes in ~30s with all 60 entities going through
the atomic-swap path. CI on the same commit still hangs past 300s.
Add a structured log line at the top of every promote() so the next
CI run shows which entities reach promotion and what shape (atomic
vs three-step) was selected — pinpoints whether a specific entity
gets stuck or if reindex never reaches promote().

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… main

The original "live settings not restored on staged after promote"
regression is fixed by PR #27930 (commit e56abb8), which is already
on main: applyLiveServingSettings + maybeForceMerge run before alias
swap, and DistributedSearchIndexExecutor wires job configuration into
the per-entity DefaultRecreateHandler. That minimal fix is sufficient
for the original symptom (newly created entities not searchable until
manual _refresh).

This commit reverts every file we refactored further on top of that
minimal fix back to origin/main:

  - DefaultRecreateHandler.java         — drop deferred-canonical-delete
                                           three-step swap, post-state
                                           checks, dataLoss tracking,
                                           safeIndexExists/checkIndexExists/
                                           checkAliasAttached helpers,
                                           promoteByAtomicAliasSwap path,
                                           ALIAS_PROMOTE_BEGIN diagnostic
  - RecreateIndexHandler.java           — drop getFailedPromotions /
                                           getDataLossPromotions defaults
  - DistributedIndexingStrategy.java    — drop FAILED-status emit and
                                           dataLoss aggregation
  - SearchIndexExecutor.java            — drop FAILED-status emit
  - DistributedSearchIndexExecutor.java — drop @Getter on
                                           recreateIndexHandler
  - The matching unit tests in DefaultRecreateHandlerTest /
    DistributedIndexingStrategyTest / SearchIndexExecutorControlFlowTest /
    DistributedSearchIndexExecutorTest all revert to origin/main.

The branch's only remaining contribution is the AppsResourceIT
case-mismatch fix in waitForAppJobCompletion (a pre-existing bug
discovered while diagnosing).

Reason: the further refactor has consistently failed CI on this branch
since the first commit. Local repro on the CI-equivalent stack
(./docker/run_local_docker.sh -d mysql -m no-ui -s true -i true) showed
the canonical-is-alias fix working end-to-end (~30s, 60 entities), but
CI still timed out at 300s. Without server-side logs from CI we can't
target the remaining gap. PR #27930 already addresses the user-visible
regression, so the safest move is to drop the further refactor and ship
just the case-mismatch fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the first reindex, the canonical name (e.g. table_search_index)
is an alias on the previous staged, not a concrete index. OpenSearch's
listIndicesByPrefix returns the alias name as one of its result keys,
which then drives a deleteIndexWithBackoff(canonicalIndex) attempt
that fails with "[illegal_argument_exception] matches an alias,
specify the corresponding concrete indices instead". The 6-attempt
exponential backoff burns ~31s per entity (1+2+4+8+16s); on a 60-entity
reindex that wastes ~30 minutes in cleanup with search degraded
throughout.

Drop the alias name from oldIndicesToDelete when getIndicesByAlias
proves canonical is an alias right now. The atomic swapAliases call
moves the canonical alias from the old concrete to the new staged in
one step; the underlying old concrete is already in oldIndicesToDelete
and gets cleaned up normally by the post-swap loop. No three-step swap
or deferred-canonical-delete restructure needed.

Adjusts testFinalizeReindexPromotesPartialData to use a realistic
canonical-concrete setup (no self-alias on its own name — that state
cannot exist in real OS/ES) so the new guard does not misfire on the
existing test fixture.

New unit test testFinalizeReindexSkipsDeleteWhenCanonicalIsAlias locks
the new behavior: when canonical is an alias, deleteIndexWithBackoff
is never called with the canonical name, and the old concrete rebuild
is cleaned up via the swap path.

Verified locally on the CI-equivalent stack
(./docker/run_local_docker.sh -d mysql -m no-ui -s true -i true): both
first reindex (canonical-concrete) and second reindex (canonical-alias)
complete in ~8s with no "matches an alias" errors and clean cleanup
logs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 6, 2026

Code Review ✅ Approved 4 resolved / 4 findings

Restores live settings on the per-entity promote path by correctly applying index settings and wiring job configurations, ensuring indices become searchable post-promotion. Addresses several issues including unhandled exceptions during promotion, missing cleanup of REST clients in tests, and incorrect alias swap logic.

✅ 4 resolved
Edge Case: applyLiveServingSettings/maybeForceMerge exception skips unregisterStagedIndex

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/DefaultRecreateHandler.java:330-331 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/DefaultRecreateHandler.java:334 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/DefaultRecreateHandler.java:408-409
Lines 330-331 in promoteEntityIndex call applyLiveServingSettings and maybeForceMerge outside the try/catch/finally block (lines 334-410). If either throws (e.g., Elasticsearch connectivity issue during updateIndexSettings), the finally at line 408 that calls searchRepository.unregisterStagedIndex(entityType, stagedIndex) is never reached. This leaves the staged index permanently registered, potentially routing live writes to an un-promoted index that nothing reads from.

This mirrors the same pattern in finalizeReindex (lines 139-140 are also outside its try block), so it's a pre-existing design choice being replicated here. However, since the per-entity path runs mid-job (not at terminal cleanup), the window for a transient ES failure is wider.

Quality: Rest5Client created but never closed in integration test

📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/SearchIndexAliasPromotionIT.java:250
In readIndexSettings, TestSuiteBootstrap.createSearchClient() returns a Rest5Client that is never closed. Over repeated test runs or if the test is retried, this leaks HTTP connections to Elasticsearch. Use try-with-resources or close in a finally block.

Edge Case: Post-state checks can throw unhandled, leaving promotion in limbo

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/DefaultRecreateHandler.java:344 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/DefaultRecreateHandler.java:371
The new post-state verification calls at lines 344 and 371 (searchClient.indexExists(canonicalIndex) and searchClient.getAliases(stagedIndex)) are outside any try-catch block. If these calls throw (network timeout, transport error), the exception propagates unhandled from promoteWithDeferredCanonicalDelete.

The getAliases call at line 371 is especially dangerous: at that point the canonical index has already been deleted. An unhandled exception means:

  • The canonical index is gone (data loss state)
  • markPromotionFailed is never called, so the failure isn't recorded
  • The alias may or may not have been attached

Similarly, if indexExists at line 344 throws after deleteIndexWithBackoff succeeded, we never proceed to add the canonical alias.

Bug: promoteByAtomicAliasSwap hardcodes reindexSuccess=true

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/DefaultRecreateHandler.java:338-346 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/DefaultRecreateHandler.java:476 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/DefaultRecreateHandler.java:496
When canonicalIsAlias is true, promoteWithDeferredCanonicalDelete delegates to promoteByAtomicAliasSwap (line 338-345) without forwarding the reindexSuccess parameter. promoteByAtomicAliasSwap then hardcodes true at line 496 when calling recordPromotionSuccessAndCleanupOldIndices. This means partial-data promotions (reindexSuccess=false with docCount>0) on re-reindexed entities will be logged as reindexSuccess: true, producing misleading operational logs. The functional impact is low since reindexSuccess is only used in the log message, but it makes debugging harder when investigating partial-data scenarios.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 6, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

🟡 Playwright Results — all passed (10 flaky)

✅ 4002 passed · ❌ 0 failed · 🟡 10 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 298 0 1 4
🟡 Shard 2 747 0 5 8
🟡 Shard 3 758 0 3 7
✅ Shard 4 775 0 0 18
✅ Shard 5 687 0 0 41
🟡 Shard 6 737 0 1 8
🟡 10 flaky test(s) (passed on retry)
  • Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/DataProductRename.spec.ts › should handle multiple consecutive renames and preserve assets (shard 2, 1 retry)
  • Features/DataQuality/BundleSuiteBulkOperations.spec.ts › Add test case to existing Bundle Suite (shard 2, 1 retry)
  • Features/EntitySummaryPanel.spec.ts › should display summary panel for tableColumn (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/Table.spec.ts › Tags term should be consistent for search (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants