Skip to content

Make SearchIndexing distributed-only#27971

Open
harshach wants to merge 14 commits intomainfrom
harshach/searchindex-distributed-default
Open

Make SearchIndexing distributed-only#27971
harshach wants to merge 14 commits intomainfrom
harshach/searchindex-distributed-default

Conversation

@harshach
Copy link
Copy Markdown
Collaborator

@harshach harshach commented May 7, 2026

Describe your changes:

Fixes N/A

I made SearchIndexing always use distributed staged-index reindexing because the app should avoid live-index writes and no longer expose distributed/recreate mode choices. This removes the old single-server pipeline/classes and legacy config options, adds helper classes for entity type handling, stats mapping, config sanitization, and staged finalization, and updates generated schemas/docs/scripts. Testing: mvn -pl openmetadata-service spotless:apply -DskipTests; focused backend suite with 150 tests; UI schema Jest test; git diff --check; local Docker deployment with latest SearchIndexingApplication run success, 36/36 records indexed, 0 failures, and no legacy flags in app-run config.

Type of change:

  • Improvement

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: runtime sanitization removes legacy SearchIndexing config keys from persisted app-run records, so a migration script is not needed.
  • I have added tests around the new logic.

Summary by Gitar

  • Bug fixes:
    • Prevented redundant column index promotion by tracking finalizedEntities during reindexing.
    • Added logical routing to ensure table and column reindexing finalization are coordinated.
  • Test coverage expansion:
    • Added DistributedReindexFinalizerTest to verify idempotent reindex finalization and correct handling of linked entity types.

This will update automatically on new commits.

Copilot AI review requested due to automatic review settings May 7, 2026 16:22
@harshach harshach requested a review from a team as a code owner May 7, 2026 16:22
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels May 7, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes the SearchIndexing application distributed-only and enforces staged index writes with alias promotion, removing legacy single-server classes and the recreateIndex / useDistributedIndexing configuration toggles across backend, UI schemas, docs, and scripts.

Changes:

  • Remove legacy mode flags (recreateIndex, useDistributedIndexing) from schemas/config handling and sanitize persisted app/run configs.
  • Delete single-server indexing pipeline/strategy code and related tests; route all reindexing through distributed staged-index flow.
  • Add helper utilities for entity-type normalization, distributed stats mapping, and staged finalization/promotion.

Reviewed changes

Copilot reviewed 60 out of 70 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
openmetadata-ui/src/main/resources/ui/src/utils/ApplicationSchemas/SearchIndexingApplication.json Removes legacy mode toggles from UI schema and updates language field description.
openmetadata-ui/src/main/resources/ui/src/generated/metadataIngestion/workflow.ts Regenerates types to drop legacy indexing config field.
openmetadata-ui/src/main/resources/ui/src/generated/metadataIngestion/applicationPipeline.ts Regenerates types to drop legacy indexing config field.
openmetadata-ui/src/main/resources/ui/src/generated/metadataIngestion/application.ts Regenerates types to drop legacy indexing config field.
openmetadata-ui/src/main/resources/ui/src/generated/entity/services/ingestionPipelines/ingestionPipeline.ts Regenerates types to drop legacy indexing config field.
openmetadata-ui/src/main/resources/ui/src/generated/entity/applications/marketplace/createAppMarketPlaceDefinitionReq.ts Regenerates types to drop legacy indexing config field.
openmetadata-ui/src/main/resources/ui/src/generated/entity/applications/marketplace/appMarketPlaceDefinition.ts Regenerates types to drop legacy indexing config field.
openmetadata-ui/src/main/resources/ui/src/generated/entity/applications/configuration/internal/searchIndexingAppConfig.ts Regenerates SearchIndexing app config type to remove legacy options and update docs.
openmetadata-ui/src/main/resources/ui/src/generated/entity/applications/app.ts Regenerates types to drop legacy indexing config field.
openmetadata-ui/src/main/resources/ui/src/generated/api/services/ingestionPipelines/createIngestionPipeline.ts Regenerates API types to drop legacy indexing config field.
openmetadata-ui/src/main/resources/ui/public/locales/en-US/Applications/SearchIndexingApplication.md Removes legacy option docs and updates wording to staged promotion.
openmetadata-spec/src/main/resources/json/schema/entity/applications/configuration/internal/searchIndexingAppConfig.json Removes legacy mode options from the SearchIndexing app JSON schema.
openmetadata-service/src/test/java/org/openmetadata/service/cache/EntityCacheBypassTest.java Updates test docstring to reflect removal of single-server executor path.
openmetadata-service/src/test/java/org/openmetadata/service/apps/logging/AppRunLogAppenderTest.java Updates logger name expectation after executor/orchestrator refactor.
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/SingleServerIndexingStrategyTest.java Deletes tests for removed single-server strategy.
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/SearchIndexStatsTest.java Deletes tests tied to removed single-server stats/executor implementation.
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/SearchIndexFailureScenarioTest.java Deletes tests tied to removed single-server failure scenarios.
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/SearchIndexEndToEndTest.java Deletes end-to-end test targeting removed executor flow.
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/ReindexingOrchestratorTest.java Updates orchestrator tests for distributed-only behavior + adds legacy-config sanitization assertion.
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/QuartzOrchestratorContextTest.java Updates tests for new createReindexingContext() signature.
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/QuartzJobContextTest.java Updates Quartz job context tests after removing distributed flag.
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/listeners/SlackProgressListenerTest.java Updates Slack listener config details to staged promotion wording.
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/listeners/QuartzProgressListenerTest.java Updates listener test config builder after removing legacy flags.
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/listeners/LoggingProgressListenerTest.java Updates logging listener to report staged promotion rather than recreate/distributed toggles.
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/IndexingPipelineTest.java Deletes tests for removed single-server pipeline.
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/EntityReaderRetryTest.java Deletes tests for removed single-server reader.
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/EntityReaderLifecycleTest.java Deletes tests for removed single-server reader lifecycle behavior.
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/EntityBatchSizeEstimatorTest.java Deletes tests for removed batch size estimator.
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/DistributedIndexingStrategyTest.java Updates distributed strategy tests for staged-index context and new executor signature.
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/PartitionWorkerTest.java Updates tests for staged index context being mandatory and always writing staged.
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedSearchIndexExecutorTest.java Updates executor tests for required staged index context and promotion handler renames.
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedJobParticipantTest.java Updates participant tests to include staged index mapping on jobs.
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedJobContextTest.java Updates context tests after removing isDistributed().
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/CompositeProgressListenerTest.java Updates test context stub after removing isDistributed().
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/AdaptiveBackoffTest.java Deletes tests for removed adaptive backoff.
openmetadata-service/src/main/resources/json/data/appMarketPlaceDefinition/SearchIndexingApplication.json Updates marketplace definition wording and removes legacy default config flag.
openmetadata-service/src/main/resources/json/data/app/SearchIndexingApplication.json Removes legacy default config flag from installed app config.
openmetadata-service/src/main/java/org/openmetadata/service/workflows/searchIndex/ReindexingUtil.java Updates import to share time-series entity set via new helper.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/SingleServerIndexingStrategy.java Deletes removed single-server strategy implementation.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/SearchIndexEntityTypes.java Adds centralized entity-type constants and normalization helpers.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/SearchIndexAppConfigSanitizer.java Adds runtime config sanitization to strip removed legacy keys.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/SearchIndexApp.java Sanitizes app config on init/validation and unifies distributed job status handling.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/ReindexingProgressListener.java Updates callback documentation to staged-index terminology.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/ReindexingOrchestrator.java Refactors orchestration to always run distributed strategy and sanitize configs.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/ReindexingJobContext.java Removes isDistributed() from job context API.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/ReindexingConfiguration.java Removes legacy mode flags from runtime configuration model and builders.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/QuartzOrchestratorContext.java Updates context factory signature for distributed-only mode.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/QuartzJobContext.java Removes distributed flag from Quartz job context.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/OrphanedIndexCleaner.java Updates comments to reflect staged reindexing behavior.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/OrchestratorContext.java Updates orchestrator context interface for new job context factory signature.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/listeners/SlackProgressListener.java Reports staged-promotion mode instead of recreate/distributed toggles.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/listeners/LoggingProgressListener.java Reports staged-promotion mode and removes distributed-mode logging.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/IndexingStrategy.java Deletes obsolete strategy abstraction.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/EntityReader.java Deletes removed single-server reader.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/EntityBatchSizeEstimator.java Deletes removed batch sizing helper.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/DistributedReindexStatsMapper.java Extracts distributed job→Stats mapping logic into a dedicated helper.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/DistributedReindexFinalizer.java Extracts staged index finalization/promotion logic into a dedicated helper.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/DistributedIndexingStrategy.java Enforces staged index preparation and simplified distributed executor invocation.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/PartitionWorker.java Requires staged index context and always writes to staged target indexes.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/PartitionCalculator.java Switches time-series detection to shared helper and removes duplicated constants.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedSearchIndexExecutor.java Makes staged index context mandatory and standardizes promotion handler naming.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedSearchIndexCoordinator.java Updates precompute logic to use shared time-series classification helper.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedJobParticipant.java Requires staged index mapping for participation and reconstructs staged context from it.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedJobContext.java Removes isDistributed() from distributed job context.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DISTRIBUTED_INDEXING.md Updates documentation to reflect distributed-only staged promotion and removes legacy flags.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/AdaptiveBackoff.java Deletes removed backoff utility.
bin/distributed-test/scripts/trigger-reindex.sh Removes legacy flags and updates request payload to distributed-only mode.
Comments suppressed due to low confidence (1)

openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/PartitionCalculator.java:252

  • SearchIndexEntityTypes.isTimeSeriesEntity(entityType) normalizes legacy queryCostResult to queryCostRecord, but getTimeSeriesEntityCount() still uses the unnormalized entityType for Entity.getEntityTimeSeriesRepository(entityType) and reindexConfig.getTimeSeriesStartTs(entityType). If a legacy config/job includes queryCostResult, this path will throw EntityNotFoundException and silently return 0 due to the outer catch.

Normalize entityType once (e.g., String normalized = SearchIndexEntityTypes.normalizeEntityType(entityType)) and use the normalized value consistently for repo lookups, time-window lookups, and hashing/logging.

  public long getEntityCount(String entityType, ReindexingConfiguration reindexConfig) {
    try {
      long count;
      if (SearchIndexEntityTypes.isTimeSeriesEntity(entityType)) {
        count = getTimeSeriesEntityCount(entityType, reindexConfig);
      } else {
        count = getRegularEntityCount(entityType);
      }

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

✅ TypeScript Types Auto-Updated

The generated TypeScript types have been automatically updated based on JSON schema changes in this PR.

Copilot AI review requested due to automatic review settings May 7, 2026 16:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 61 out of 71 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/PartitionCalculator.java:275

  • SearchIndexEntityTypes.isTimeSeriesEntity() normalizes legacy names (e.g. queryCostResultqueryCostRecord), but PartitionCalculator.getEntityCount() continues using the original entityType for repository lookups. This can lead to a type being classified as time-series yet still throwing EntityNotFoundException in Entity.getEntityTimeSeriesRepository(entityType) (caught and returning 0), silently skipping partitions/counts. Normalize entityType once at the start of getEntityCount()/getTimeSeriesEntityCount() (and use the normalized value consistently for filters/repository lookup).
  public long getEntityCount(String entityType, ReindexingConfiguration reindexConfig) {
    try {
      long count;
      if (SearchIndexEntityTypes.isTimeSeriesEntity(entityType)) {
        count = getTimeSeriesEntityCount(entityType, reindexConfig);
      } else {
        count = getRegularEntityCount(entityType);
      }
      LOG.debug("Entity count for {}: {}", entityType, count);
      return count;
    } catch (Exception e) {
      LOG.error("Failed to get entity count for type: {} - returning 0", entityType, e);
      return 0;
    }
  }

  private long getRegularEntityCount(String entityType) {
    EntityRepository<?> repository = Entity.getEntityRepository(entityType);
    return repository.getDao().listCount(new ListFilter(Include.ALL));
  }

  private long getTimeSeriesEntityCount(String entityType, ReindexingConfiguration reindexConfig) {
    ListFilter listFilter = new ListFilter(Include.ALL);
    EntityTimeSeriesRepository<?> repository;

    if (SearchIndexEntityTypes.isDataInsightEntity(entityType)) {
      listFilter.addQueryParam("entityFQNHash", FullyQualifiedName.buildHash(entityType));
      repository = Entity.getEntityTimeSeriesRepository(Entity.ENTITY_REPORT_DATA);
    } else {
      repository = Entity.getEntityTimeSeriesRepository(entityType);
    }

Copilot AI review requested due to automatic review settings May 7, 2026 16:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 62 out of 72 changed files in this pull request and generated 2 comments.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Jest test Coverage

UI tests summary

Lines Statements Branches Functions
Coverage: 62%
62.32% (64271/103115) 42.92% (34841/81166) 45.73% (10258/22430)

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 9, 2026

Copilot AI review requested due to automatic review settings May 10, 2026 17:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 73 out of 83 changed files in this pull request and generated 2 comments.

Comment on lines +84 to +88
finalizeEntityReindex(entityType, entitySuccess);
if (Entity.TABLE.equals(entityType)) {
promoteColumnIndex(entitySuccess);
}
} catch (Exception ex) {
public static DistributedJobNotifier create(CollectionDAO collectionDAO, String serverId) {
LOG.info(
"Redis not configured - using database polling for distributed job notifications (30s discovery delay)");
"Using database polling for distributed search indexing job discovery (2s discovery interval)");
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 10, 2026

Code Review ✅ Approved 3 resolved / 3 findings

Transitions SearchIndexing to a strictly distributed, staged-index architecture by removing legacy pipelines and config options. Logic now includes improved column index promotion and robust coordination for entity reindexing.

✅ 3 resolved
Edge Case: computeEntitySuccess returns true when stats are missing for entity

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/DistributedReindexFinalizer.java:114-117
In DistributedReindexFinalizer.computeEntitySuccess, when entityStats has entries but none for the requested entityType, the method returns true (line 116). This entity reached the finalizer because it was NOT already promoted by the per-entity completion tracker — meaning its partitions did not all complete successfully. Returning true here causes finalizeEntityReindex to promote the staged index with success=true, potentially swapping in an incomplete index that is missing data.

The intent appears to be "if there are no stats, we can't determine failure, so assume success" but the safer default for index promotion is to assume failure and leave the old index in place.

Quality: Magic string "all" instead of SearchIndexEntityTypes.ALL

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/ReindexingConfiguration.java:184
The isSmartReindexing() method in ReindexingConfiguration uses the hardcoded string literal "all" (line 184) instead of the newly introduced SearchIndexEntityTypes.ALL constant. The whole purpose of this PR was to centralize entity type handling into SearchIndexEntityTypes, yet this reference was missed.

Quality: Unused cacheConfig parameter in factory method

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedJobNotifierFactory.java:38-39
The DistributedJobNotifierFactory.create() method still accepts a CacheConfig parameter that is never used — the method unconditionally returns a PollingJobNotifier. This is dead code that misleads readers into thinking cache configuration influences notifier selection. Since Redis-based notification has been removed, the parameter should be removed from the factory (and its callers updated).

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants