fix(search): restore live settings on per-entity promote path#27920
fix(search): restore live settings on per-entity promote path#27920
Conversation
DefaultRecreateHandler exposes two finalization paths:
- finalizeReindex(...) — centralized end-of-job promotion. Calls
applyLiveServingSettings + maybeForceMerge
before the alias swap, reverting the bulk
overrides (refresh_interval=-1, replicas=0,
async translog) back to live values
(refresh=1s, replicas=1, durable translog).
- promoteEntityIndex(ctx, ok) — per-entity promotion. Used by the distributed
search-indexer's "promote as soon as all
partitions for an entity complete" callback
(DistributedSearchIndexExecutor.promoteEntityIndex).
Swaps the alias and cleans up old indices —
but never restored live settings.
When an entity finishes its partitions before the final reconciliation
(typically the smallest entities — e.g. knowledge `page` with ~11 rows),
its index is promoted via the per-entity path, the alias swap succeeds,
and the bulk-build overrides become the new live settings. refresh_interval
stays at -1 in production, so live writes after the reindex are buffered in
the translog and never reach searchable segments until a manual _refresh.
Externally this surfaces as "create an article, hierarchy is empty until I
re-trigger reindex" — exactly the user-reported bug.
Mirror the finalizeReindex sequence by calling applyLiveServingSettings
(and maybeForceMerge for parity) at the top of the promote block in
promoteEntityIndex, before the alias swap.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
DefaultRecreateHandler.applyLiveServingSettings reads from the handler's jobData field (live + bulk index-settings overrides on the EventPublisherJob). The per-entity distributed-promotion path in DistributedSearchIndexExecutor created its own DefaultRecreateHandler instance and never called withJobData(jobData) on it. With jobData=null, buildRevertJson returns null and applyLiveServingSettings silently no-ops — meaning the previous fix (b272de8) never actually re-applied live settings on the per-entity promote path, even though the call was reached. currentJob.getJobConfiguration() is the EventPublisherJob the strategy created. Wire it into the new handler at construction time, mirroring the withJobData call DistributedIndexingStrategy already makes on the strategy's own handler instance. With this change, the per-entity promote path now logs "Applying live serving settings to staged index '...' for entity 'page': {\"number_of_replicas\":1,\"refresh_interval\":\"1s\", ...}" before the alias swap, and post-promotion `_settings` show refresh_interval=1s instead of the stuck -1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Asserts that DefaultRecreateHandler.promoteEntityIndex calls searchClient.updateIndexSettings with the live-revert JSON (refresh_interval=1s, replicas=1, translog.durability=request) before swapping the alias, given a handler with bulk overrides wired through withJobData. Without the two preceding fixes the assertion fails with "Wanted but not invoked" — applyLiveServingSettings was never reached on the per-entity promotion path, so the staged index inherited refresh_interval=-1 and post-promotion live writes never became searchable until a manual _refresh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Fixes the distributed per-entity search-index promotion path so a promoted staged index is returned to live-serving settings before it starts serving reads/writes. This sits in the search reindexing flow and aligns the per-entity distributed promote path with the existing full finalizeReindex behavior.
Changes:
- Restores live-serving index settings and optional force-merge in
promoteEntityIndexbefore alias promotion. - Passes distributed job configuration into the recreate handler so per-entity promotion can resolve bulk/live index settings.
- Adds a regression test around
DefaultRecreateHandler.promoteEntityIndex.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
openmetadata-service/src/main/java/org/openmetadata/service/search/DefaultRecreateHandler.java |
Adds live-settings restore + force-merge to the per-entity promotion path. |
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedSearchIndexExecutor.java |
Wires currentJob configuration into the recreate handler used by distributed per-entity promotion. |
openmetadata-service/src/test/java/org/openmetadata/service/search/DefaultRecreateHandlerTest.java |
Adds a regression test for live-settings restoration during per-entity promotion. |
DefaultRecreateHandlerTest.PromoteEntityIndexTests:
- testPromoteEntityIndexAppliesSettingsBeforeAliasSwap: InOrder
verification that updateIndexSettings runs BEFORE swapAliases. Catches a
swap-then-revert misordering (which would briefly serve live reads against
refresh=-1 settings).
- testPromoteEntityIndexForceMergesWhenConfigured: forceMerge(staged, 1) is
invoked when bulkIndexSettings.forceMergeOnPromote=true. Catches a
regression where the force-merge call gets dropped without anyone
noticing.
- testPromoteEntityIndexSkipsSettingsWithoutJobData: locks in the safe no-op
behavior when a handler is constructed without withJobData. Documents that
no-jobData → no settings call (vs. crash or silent revert to defaults).
DistributedSearchIndexExecutorTest:
- initializeEntityTrackerWiresJobDataIntoDefaultRecreateHandler: triggers
the private initializeEntityTracker with currentJob holding a populated
jobConfiguration and verifies recreateHandler.withJobData(jobData) is
called on the per-entity handler. This catches the second half of the
original regression: even if applyLiveServingSettings is reached on
promoteEntityIndex, jobData=null makes it a silent no-op. Future edits
that drop the wiring or move handler construction elsewhere will fail
here.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Triggers SearchIndexingApplication with bulkIndexSettings configured (refresh_interval=-1, number_of_replicas=0, translog.durability=async), waits for the run to terminate, then queries _settings on the promoted table_search_index alias against the real OpenSearch/Elasticsearch container (via TestSuiteBootstrap.createSearchClient()). Asserts that each concrete index resolved by the alias has the live values applied (refresh=1s, replicas=1, translog.durability=request) and not the bulk overrides. This is the end-to-end counterpart to the unit-level regression test in DefaultRecreateHandlerTest. Catches the same class of bug at the layer where it actually surfaced in production: an alias swap that completed successfully according to logs but left the new live index unsearchable because refresh was disabled and writes were buffered indefinitely. Modeled on SearchIndexingFieldsParityIT for run-trigger / poll structure; adds the post-completion _settings verification step that no other IT performs today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dant test
DefaultRecreateHandler: move applyLiveServingSettings + maybeForceMerge
inside the try/finally that unregisters the staged index. Without this, a
transient OS/ES failure during _settings update or _forcemerge propagated out
before the finally ran, leaving searchRepository.unregisterStagedIndex
permanently registered — so live writes kept routing to a staged index
nothing reads from. Same fix applied to finalizeReindex for consistency
(its window is shorter since it runs at end-of-job, but the leak shape is
identical). Per gitar-bot review.
DefaultRecreateHandlerTest:
- testPromoteEntityIndexAppliesLiveServingSettingsBeforeSwap: replace
independent verify() calls with InOrder so the test actually locks the
"settings before alias swap" ordering its name and the PR description
promise. A swap-then-revert refactor would have passed before this.
Per Copilot review.
- Drop testPromoteEntityIndexAppliesSettingsBeforeAliasSwap (the standalone
InOrder test added in the previous commit) — folded back into the test
above, which now covers both ordering and JSON content in one place.
- Add testPromoteEntityIndexUnregistersStagedIndexOnSettingsFailure —
regression test for the gitar-bot fix above. Verified to fail with
"IllegalState connection reset" when the calls are moved back outside
the try block.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The why-it's-in-a-try and why-jobData-is-wired blocks read like commit messages, not code annotations. Tests and commit history carry the rationale; the code itself reads fine without them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ic IT
Six findings from copilot-pull-request-reviewer.
1+4. Track canonicalDeleted via positive evidence only (DefaultRecreateHandler).
ES/OS indexExists/getAliases/listIndicesByPrefix all swallow transport
errors and return false/empty. A "negative" probe result cannot be
distinguished from "probe failed". The previous shape blindly trusted
probe values and could misclassify a transient failure as data loss
when canonical was actually still serving (gate-says-no-but-actually-yes
case) or, conversely, as not-data-loss when canonical was actually
gone. Now: a `canonicalDeleted` boolean defaults to false and only
flips to true after both the delete call and a positive post-state
check confirm the index is gone. dataLoss classification uses this
flag — never claim data loss without positive evidence canonical was
deleted. Added regression test
`testPromoteEntityIndexAmbiguousGateProbeIsNotDataLoss` for the gate-
ambiguity case.
2. Wire onJobFailed in DistributedIndexingStrategy.
Previously only SearchIndexExecutor emitted the listener callback for
FAILED. The distributed strategy returned the status but never invoked
listeners.onJobFailed, so jobData.failure stayed empty and the
AppRunRecord/WebSocket update had no failure context. Now mirrors the
single-server behavior with an IllegalStateException naming the
data-loss entities.
3. IT: assertEquals("request", durability) instead of assertNotEquals
("async"). The non-equals assertion would pass if a silent translog-
revert drop left durability at any non-async cluster default, missing
the regression. Pin the exact configured value.
5. IT: assert exactly one concrete index resolves the alias, not
at-least-one. A broken swap that leaves the alias attached to BOTH
the pre-existing live index AND the new *_rebuild_* index would
satisfy "any rebuild present" but duplicate search results in
production. Use assertEquals(1, settingsByIndex.size()).
6. IT hermeticity: snapshot/restore SearchIndexingApplication's
appConfiguration. /v1/apps/trigger/{name} merges payload into the
persisted config, so without restore later tests in the suite
inherit this test's bulkIndexSettings / liveIndexSettings /
useDistributedIndexing values — making suite ordering change what
they exercise. Both IT methods now do a try/finally with
snapshotAppConfig() + restoreAppConfig().
All 151 unit tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
restoreAppConfig POSTs to /v1/apps/trigger/{name}, which both merges the
body into the persisted config AND starts a new run. Returning without
waiting left SearchIndexingApplication running into the next test class
(AppsResourceIT.test_triggerApp_200), which then timed out for 2 minutes
on "already running".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The terminal-state check compared status against uppercase "SUCCESS",
"FAILED", "COMPLETED", but appRunRecord.json defines the status enum
in lowercase ("success", "failed", "completed", ...). The check never
matched and the 30s wait silently fell through to the catch block,
making it a no-op. test_triggerApp_200 then relied on its 2-minute
"already running" trigger retry, which timed out whenever a longer
reindex (e.g. SearchIndexingFieldsParityIT's "all entities" reindex)
was still in flight.
Switch the terminal check to "not running and not started"
case-insensitively, and raise the ceiling to 5 minutes so the wait
actually covers a long in-flight reindex.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| if (canonicalIndex == null || stagedIndex == null) { | ||
| LOG.error( | ||
| "Cannot finalize reindex for entity '{}'. canonicalIndex={}, stagedIndex={}", | ||
| "Cannot promote index for entity '{}'. canonicalIndex={}, stagedIndex={}", | ||
| entityType, | ||
| canonicalIndex, |
…-promote # Conflicts: # openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedSearchIndexExecutor.java # openmetadata-service/src/main/java/org/openmetadata/service/search/DefaultRecreateHandler.java
…ata/OpenMetadata into harshach/search-alias-promote
The test triggers SearchIndexingApplication and waits for it to complete, but during the parallel-tests slot other classes can also trigger the same app concurrently. The resulting in-flight run then leaks into AppsResourceIT.test_triggerApp_200 (which runs in the sequential slot) and exhausts its 2-minute "already running" trigger Awaitility. AppsResourceIT is already in the sequential bucket for the same reason. Mirror it for SearchIndexAliasPromotionIT across all seven failsafe profiles (mysql/postgres × elasticsearch/opensearch and the RDF profile) — include in sequential-tests, exclude from parallel-tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DefaultRecreateHandler.promote: when canonicalIndex is null but
stagedIndex is non-null, the early-return previously skipped the
finally clause that unregisters the staged index. Live writes could
stay routed to the staged index after we bail. Release the
registration explicitly before returning. Extend the existing
testPromoteWithNullCanonicalIndex unit test to assert the unregister
call.
SearchIndexAliasPromotionIT.snapshotAppConfig: distinguish "snapshot
failed" from "config absent". The previous Map.of() return on
exception caused restoreAppConfig to POST an empty body to
/v1/apps/trigger/{name}, which is a no-op merge that silently leaks
this test's bulk/live setting overrides into downstream tests and
starts a spurious app run. Return null on failure so the caller
short-circuits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The IT triggered the bundled SearchIndexingApplication, which made it expensive (full reindex per run, ~1-2 minutes) and a source of bleed-through into AppsResourceIT.test_triggerApp_200 even after moving to the sequential failsafe bucket. The same surface is already covered by unit tests: - DefaultRecreateHandlerTest "Should restore live serving settings on staged index before alias swap" (and the per-step swap-order, data-loss-flagging, post-state-check tests) - DistributedSearchIndexExecutorTest verifies the withJobData(jobConfig) wiring on the per-entity promotion handler Drop the IT and revert its pom.xml include/exclude entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DefaultRecreateHandler.promote: validate aliasesToAttach BEFORE applyLiveServingSettings/maybeForceMerge so the empty-aliases error path skips wasted I/O and segment churn on a staged index that will never be swapped in. Extend testPromoteWithEmptyContextAliases to verify updateIndexSettings and forceMerge are not invoked. Adjust testPromoteEntityIndexUnregistersStagedIndexOnSettingsFailure to populate canonicalAliases so it still reaches applyLiveServingSettings. Refresh checkAliasAttached Javadoc: with positive-evidence canonicalDeleted tracking, a null result is classified as data loss only when canonicalDeleted=true; otherwise it is degraded (retryable). The previous wording claimed every null was data loss, which no longer matches the call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On every reindex after the first, the canonical name (e.g.
openmetadata_table_search_index) is an alias on the previous staged
index — not a concrete index. The three-step swap then attempted
deleteIndexWithBackoff(canonicalAlias), which OS/ES rejects with
illegal_argument_exception ("matches an alias, specify the
corresponding concrete indices instead"). The 6-attempt exponential
backoff burned ~31s per entity. With 30+ default entity types the
SearchIndexingApplication ran past CI's 300s setup budget, blocking
every Playwright shard and python-integration setup at "Setup
OpenMetadata Test Environment".
Two-pronged fix in DefaultRecreateHandler.promote:
1. Detect canonicalIsAlias via getIndicesByAlias(canonicalIndex).
When true, take the new promoteByAtomicAliasSwap path: a single
swapAliases call atomically moves every alias (parents + canonical)
from old → new staged. No name collision, no per-entity
delete-by-alias-name, no degraded/data-loss windows.
2. listIndicesByPrefix returns the canonical alias name itself among
its results (alongside concrete *_rebuild_* indices). Filter that
out of oldIndicesToCleanup when canonicalIsAlias, so the cleanup
loop's deleteIndexWithBackoff doesn't replay the same 31s backoff.
Keep canonical in cleanup when it is concrete — that's where the
first-reindex flow drops the original concrete after the three-step
swap moves aliases off it.
Local repro: SearchIndexingApplication now completes in ~7s instead
of hanging past 300s. New unit test
testPromoteEntityIndexAtomicSwapWhenCanonicalIsAlias locks the new
shape (single swap, no delete-by-alias, old concrete cleaned up).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Local repro on the CI-equivalent stack (mysql + elasticsearch + sample data + ingestion) completes in ~30s with all 60 entities going through the atomic-swap path. CI on the same commit still hangs past 300s. Add a structured log line at the top of every promote() so the next CI run shows which entities reach promotion and what shape (atomic vs three-step) was selected — pinpoints whether a specific entity gets stuck or if reindex never reaches promote(). No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… main The original "live settings not restored on staged after promote" regression is fixed by PR #27930 (commit e56abb8), which is already on main: applyLiveServingSettings + maybeForceMerge run before alias swap, and DistributedSearchIndexExecutor wires job configuration into the per-entity DefaultRecreateHandler. That minimal fix is sufficient for the original symptom (newly created entities not searchable until manual _refresh). This commit reverts every file we refactored further on top of that minimal fix back to origin/main: - DefaultRecreateHandler.java — drop deferred-canonical-delete three-step swap, post-state checks, dataLoss tracking, safeIndexExists/checkIndexExists/ checkAliasAttached helpers, promoteByAtomicAliasSwap path, ALIAS_PROMOTE_BEGIN diagnostic - RecreateIndexHandler.java — drop getFailedPromotions / getDataLossPromotions defaults - DistributedIndexingStrategy.java — drop FAILED-status emit and dataLoss aggregation - SearchIndexExecutor.java — drop FAILED-status emit - DistributedSearchIndexExecutor.java — drop @Getter on recreateIndexHandler - The matching unit tests in DefaultRecreateHandlerTest / DistributedIndexingStrategyTest / SearchIndexExecutorControlFlowTest / DistributedSearchIndexExecutorTest all revert to origin/main. The branch's only remaining contribution is the AppsResourceIT case-mismatch fix in waitForAppJobCompletion (a pre-existing bug discovered while diagnosing). Reason: the further refactor has consistently failed CI on this branch since the first commit. Local repro on the CI-equivalent stack (./docker/run_local_docker.sh -d mysql -m no-ui -s true -i true) showed the canonical-is-alias fix working end-to-end (~30s, 60 entities), but CI still timed out at 300s. Without server-side logs from CI we can't target the remaining gap. PR #27930 already addresses the user-visible regression, so the safest move is to drop the further refactor and ship just the case-mismatch fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the first reindex, the canonical name (e.g. table_search_index) is an alias on the previous staged, not a concrete index. OpenSearch's listIndicesByPrefix returns the alias name as one of its result keys, which then drives a deleteIndexWithBackoff(canonicalIndex) attempt that fails with "[illegal_argument_exception] matches an alias, specify the corresponding concrete indices instead". The 6-attempt exponential backoff burns ~31s per entity (1+2+4+8+16s); on a 60-entity reindex that wastes ~30 minutes in cleanup with search degraded throughout. Drop the alias name from oldIndicesToDelete when getIndicesByAlias proves canonical is an alias right now. The atomic swapAliases call moves the canonical alias from the old concrete to the new staged in one step; the underlying old concrete is already in oldIndicesToDelete and gets cleaned up normally by the post-swap loop. No three-step swap or deferred-canonical-delete restructure needed. Adjusts testFinalizeReindexPromotesPartialData to use a realistic canonical-concrete setup (no self-alias on its own name — that state cannot exist in real OS/ES) so the new guard does not misfire on the existing test fixture. New unit test testFinalizeReindexSkipsDeleteWhenCanonicalIsAlias locks the new behavior: when canonical is an alias, deleteIndexWithBackoff is never called with the canonical name, and the old concrete rebuild is cleaned up via the swap path. Verified locally on the CI-equivalent stack (./docker/run_local_docker.sh -d mysql -m no-ui -s true -i true): both first reindex (canonical-concrete) and second reindex (canonical-alias) complete in ~8s with no "matches an alias" errors and clean cleanup logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code Review ✅ Approved 4 resolved / 4 findingsRestores live settings on the per-entity promote path by correctly applying index settings and wiring job configurations, ensuring indices become searchable post-promotion. Addresses several issues including unhandled exceptions during promotion, missing cleanup of REST clients in tests, and incorrect alias swap logic. ✅ 4 resolved✅ Edge Case: applyLiveServingSettings/maybeForceMerge exception skips unregisterStagedIndex
✅ Quality: Rest5Client created but never closed in integration test
✅ Edge Case: Post-state checks can throw unhandled, leaving promotion in limbo
✅ Bug: promoteByAtomicAliasSwap hardcodes reindexSuccess=true
OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|
🟡 Playwright Results — all passed (10 flaky)✅ 4002 passed · ❌ 0 failed · 🟡 10 flaky · ⏭️ 86 skipped
🟡 10 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |



Summary
Reindex promotion succeeds but the swapped-in index keeps the bulk-build settings (
refresh_interval=-1,replicas=0,translog=async), so live writes after promotion are buffered indefinitely and never become searchable until a manual_refresh. This surfaces in production as "indexing is fast, but searches/UIs return nothing on freshly created entities."Two coupled bugs in the per-entity distributed promote path:
DefaultRecreateHandler.promoteEntityIndexwas missing the live-settings restore.finalizeReindexcallsapplyLiveServingSettings+maybeForceMergebefore the alias swap; the per-entity sibling jumped straight from doc-count check to alias swap. Mirror the call sequence.DistributedSearchIndexExecutorconstructed a secondRecreateIndexHandlerwithout wiringjobData. WithjobData=null,applyLiveServingSettingsresolves no overrides,buildRevertJsonreturnsnull, and the call silently no-ops — so even if the previous fix were in place, the per-entity path would still miss settings restoration. PasscurrentJob.getJobConfiguration()into the handler at construction, mirroring whatDistributedIndexingStrategyalready does.Provenance
These two source changes are cherry-picks of
b272de85f9and68dad610ed, which currently sit unmerged on thecontext_centerfeature branch (PR #27558) — a Knowledge → Context Center rename PR that is unrelated to search indexing. Splitting them out so they can ship independently.A regression test (
testPromoteEntityIndexAppliesLiveServingSettingsBeforeSwap) is added to lock the behavior in: it builds a handler with bulk overrides onEventPublisherJob, callspromoteEntityIndex, and assertssearchClient.updateIndexSettingsis invoked withrefresh_interval=1s,number_of_replicas=1,translog.durability=requestbefore the alias swap. Verified locally: the test passes with both source commits applied; reverting either one fails the assertion withWanted but not invoked.Test plan
mvn test -pl openmetadata-service -Dtest='DefaultRecreateHandlerTest,DistributedSearchIndexExecutorTest,DistributedIndexingStrategyTest,SearchIndexExecutorControlFlowTest'→ 136 tests, 0 failuresapplyLiveServingSettings/maybeForceMergeare removed frompromoteEntityIndex(Mockito reportsWanted but not invoked)replicas=0,refresh=-1,translog=async) and confirm post-promotion_settingsshowrefresh_interval=1s, not-1🤖 Generated with Claude Code
Summary by Gitar
DefaultRecreateHandlerto prevent unnecessarydelete-by-alias-nameoperations on canonical indices.testFinalizeReindexSkipsDeleteWhenCanonicalIsAliasto verify the cleanup guard and ensure correct alias migration to the staged index.This will update automatically on new commits.