Skip to content

fix(search): align dataAsset aggregation counts with index=tableColumn totals#27846

Open
mohityadav766 wants to merge 12 commits intomainfrom
fix-column-agg
Open

fix(search): align dataAsset aggregation counts with index=tableColumn totals#27846
mohityadav766 wants to merge 12 commits intomainfrom
fix-column-agg

Conversation

@mohityadav766
Copy link
Copy Markdown
Member

@mohityadav766 mohityadav766 commented Apr 30, 2026

Summary

  • Route the dataAsset/all alias through a new buildAllAssetsSearchBuilderV2, which builds a per-entity-type bool union — each clause is filter(entityType=<type>) must(<type's own query>). The tableColumn branch reuses the column builder; every other type goes through its dedicated asset config; a fallback should covers types in the dataAsset alias without a config (e.g. glossary, apiCollection). Each entity-type bucket count therefore equals what the dedicated index returns for the same query, by construction.
  • Tighten buildColumnSearchBuilderV2 to Operator.And so multi-token queries like first_name require every analyzer sub-token to match somewhere (the previous Or + min_should_match=0 over-matched on any single sub-token).
  • Add name.ngram, name.compound, displayName.ngram, displayName.compound to ColumnSearchIndex.getFields() so prefix queries (e.g. fir) still match column docs from both index=tableColumn and the dataAsset bucket.

Issue: open-metadata/openmetadata-collate#3851

Reproduction / verification

scripts/reproduce_column_agg_mismatch.sh is a multi-query probe that exits non-zero on any divergence between the dataAsset aggregation bucket and the index=tableColumn total. After this PR, all probed queries return matching counts:

[OK] q='first_name'           agg.tableColumnBucket=85   tableColumn.total=85
[OK] q='last_name'            agg.tableColumnBucket=45   tableColumn.total=45
[OK] q='first name'           agg.tableColumnBucket=85   tableColumn.total=85
[OK] q='shipping address'     agg.tableColumnBucket=0    tableColumn.total=0
[OK] q='first name address'   agg.tableColumnBucket=0    tableColumn.total=0
[OK] q='fir'                  agg.tableColumnBucket=45   tableColumn.total=45

Tests covering bucket-parity for tableColumn/table and the sub-token over-match guard are coming in a follow-up commit.

Test plan

  • Restart OM, run ./scripts/reproduce_column_agg_mismatch.sh — all probed queries return OK.
  • Spot-check the explore search bar in the UI: type first_name, confirm the tableColumn tab badge equals the entity-type aggregation panel count.
  • Type fir — column results appear, and the bucket count matches the tableColumn tab.
  • Confirm specific-index queries (index=table, index=topic, etc.) still behave as before.

🤖 Generated with Claude Code


Summary by Gitar

  • Search Performance Optimization:
    • Optimized buildUnconfiguredAssetFallbackV2 in both ElasticSearchSourceBuilderFactory and OpenSearchSourceBuilderFactory by replacing iterative mustNot term queries with a single, more efficient termsQuery.
    • Added utility methods termsQuery to ElasticQueryBuilder and OpenSearchQueryBuilder to support batch filtering of entity types.

This will update automatically on new commits.

…n totals

Refactor the `dataAsset`/`all` alias query path so each entity-type bucket in
the aggregation matches what its dedicated index returns for the same query.
The composite asset config used to merge fields from every type, then apply
phrase/ngram-fuzzy semantics to all docs; column docs got semantics different
from `buildColumnSearchBuilderV2`, which is why the explore search bar's two
calls disagreed on the tableColumn count.

The new `buildAllAssetsSearchBuilderV2` builds a per-entity-type bool union:
each clause is `filter(entityType=<type>) must(<type's own query>)`. The
column branch reuses `buildColumnMultiMatchV2`; every other type goes through
`buildBaseQueryV2` with its dedicated config; an extra `should` covers asset
types in the `dataAsset` alias that lack a config (e.g. `glossary`,
`apiCollection`) using the default config.

Also tightens the column builder to `Operator.And` so multi-token queries
like `first_name` require every sub-token to match somewhere — fixes the
`om_analyzer`-driven over-match where the lenient `Or` + `min_should_match=0`
variant matched any column whose name contained just `first` or just `name`.

`ColumnSearchIndex.getFields()` gains `name.ngram`, `name.compound`,
`displayName.ngram`, `displayName.compound` so prefix queries like `fir`
still match column docs from both `index=tableColumn` and the dataAsset
bucket.

Includes `scripts/reproduce_column_agg_mismatch.sh` — multi-query probe that
exits non-zero on any divergence between `index=dataAsset` (aggregation
bucket) and `index=tableColumn` (total) for the same query.

Issue: open-metadata/openmetadata-collate#3851

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 30, 2026 11:01
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Apr 30, 2026
Adds integration tests under ColumnSearchIndexIT that pin the behavior of the
fix in PR #27846:

- testDataAssetTableColumnAggregationMatchesTableColumnTotal: dataAsset
  bucket count for tableColumn equals index=tableColumn total for a
  multi-token query against seeded columns.
- testColumnQueryRequiresAllSubtokensToMatch: query "<tag>_first_name" must
  match the seeded "<tag>_first_name" column but NOT "<tag>_first_id" — pins
  the Operator.And fix that closed the om_analyzer sub-token over-match.
- testDataAssetTableBucketMatchesTableIndexTotal: same parity guarantee for
  the "table" entity-type bucket, exercising the per-type-union path for a
  non-column type.
- testPrefixQueryMatchesViaNgramOnBothPaths: short prefix queries (e.g. the
  first few chars of the seeded tag) must match seeded columns via
  name.ngram and stay in parity across both endpoints.
- testUnconfiguredAssetTypeFallbackMatchesViaDataAsset: a Glossary doc (an
  asset type without an explicit searchSettings.assetTypeConfigurations
  entry) must still surface via index=dataAsset, exercising the fallback
  should clause in buildPerTypeUnionQueryV2.

Issue: open-metadata/openmetadata-collate#3851

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the search query-building logic for the all / dataAsset aliases so that entity-type aggregation bucket counts (notably tableColumn) align with the totals returned by the corresponding dedicated index queries, and tightens column query matching to avoid underscore sub-token overmatching.

Changes:

  • Route index=all and index=dataAsset through a new per-entity-type union query builder (ElasticSearch + OpenSearch) to align aggregation bucket counts with dedicated index totals.
  • Tighten tableColumn query behavior by reusing a shared column multi-match builder with Operator.And.
  • Expand column searchable fields to include name.ngram, name.compound, displayName.ngram, and displayName.compound to preserve prefix/partial matching behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

File Description
scripts/reproduce_column_agg_mismatch.sh Adds a probe script to reproduce/verify the dataAsset aggregation vs tableColumn totals mismatch.
openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchSourceBuilderFactory.java Implements the per-entity-type union query for all/dataAsset and centralizes the stricter column multi-match.
openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchSourceBuilderFactory.java Mirrors the per-entity-type union query approach for ElasticSearch and centralizes the stricter column multi-match.
openmetadata-service/src/main/java/org/openmetadata/service/search/indexes/ColumnSearchIndex.java Adds ngram/compound subfields to the column field boost map to support prefix/partial matching.

Comment on lines +392 to +394
* field. Without {@code And}, a query like {@code first_name} matches any column whose name
* contains just {@code first} or just {@code name}, which both inflates the column index hits
* and creates the dataAsset/tableColumn count mismatch tracked in github issue #3851.
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Javadoc references the old behavior as Operator.Or + min_should_match=0, but the helper used here takes fuzziness as the last parameter and only sets minimum_should_match when fuzziness is enabled and operator is OR. Please adjust the comment to match the real previous query shape to avoid confusion.

Suggested change
* field. Without {@code And}, a query like {@code first_name} matches any column whose name
* contains just {@code first} or just {@code name}, which both inflates the column index hits
* and creates the dataAsset/tableColumn count mismatch tracked in github issue #3851.
* field. Previously this query used the same helper with {@code Operator.Or} and a last
* argument of {@code "0"}; with that OR-style query, a search like {@code first_name} matched
* any column whose name contained just {@code first} or just {@code name}, which inflated the
* column index hits and created the dataAsset/tableColumn count mismatch tracked in github
* issue #3851.

Copilot uses AI. Check for mistakes.
set -euo pipefail

HOST="${OM_HOST:-http://localhost:8585}"
TOKEN="${OM_TOKEN:-eyJraWQiOiJHYjM4OWEtOWY3Ni1nZGpzLWE5MmotMDI0MmJrOTQzNTYiLCJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJzdWIiOiJhZG1pbiIsImlzQm90IjpmYWxzZSwiaXNzIjoib3Blbi1tZXRhZGF0YS5vcmciLCJpYXQiOjE2NjM5Mzg0NjIsImVtYWlsIjoiYWRtaW5Ab3Blbm1ldGFkYXRhLm9yZyJ9.tS8um_5DKu7HgzGBzS1VTA5uUjKWOCU0B_j08WXBiEC0mr0zNREkqVfwFDD-d24HlNEbrqioLsBuFRiwIWKc1m_ZlVQbG7P36RUxhuv2vbSp80FKyNM-Tj93FDzq91jsyNmsQhyNv_fNr3TXfzzSPjHt8Go0FMMP66weoKMgW2PbXlhVKwEuXUHyakLLzewm9UMeQaEiRzhiTMU3UkLXcKbYEJJvfNFcLwSl9W8JCO_l0Yj3ud-qt_nQYEZwqW6u5nfdQllN133iikV4fM5QZsMCnm8Rq1mvLR0y9bmJiD7fwM1tmJ791TUWqmKaTnP49U493VanKpUAfzIiOiIbhg}"
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script hard-codes a full JWT as the default OM_TOKEN value. Even if it’s intended for local dev, committing tokens is a security risk and encourages running with a privileged credential. Please remove the embedded token and instead require OM_TOKEN to be provided (or fail fast with a clear message / optionally prompt).

Suggested change
TOKEN="${OM_TOKEN:-eyJraWQiOiJHYjM4OWEtOWY3Ni1nZGpzLWE5MmotMDI0MmJrOTQzNTYiLCJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJzdWIiOiJhZG1pbiIsImlzQm90IjpmYWxzZSwiaXNzIjoib3Blbi1tZXRhZGF0YS5vcmciLCJpYXQiOjE2NjM5Mzg0NjIsImVtYWlsIjoiYWRtaW5Ab3Blbm1ldGFkYXRhLm9yZyJ9.tS8um_5DKu7HgzGBzS1VTA5uUjKWOCU0B_j08WXBiEC0mr0zNREkqVfwFDD-d24HlNEbrqioLsBuFRiwIWKc1m_ZlVQbG7P36RUxhuv2vbSp80FKyNM-Tj93FDzq91jsyNmsQhyNv_fNr3TXfzzSPjHt8Go0FMMP66weoKMgW2PbXlhVKwEuXUHyakLLzewm9UMeQaEiRzhiTMU3UkLXcKbYEJJvfNFcLwSl9W8JCO_l0Yj3ud-qt_nQYEZwqW6u5nfdQllN133iikV4fM5QZsMCnm8Rq1mvLR0y9bmJiD7fwM1tmJ791TUWqmKaTnP49U493VanKpUAfzIiOiIbhg}"
TOKEN="${OM_TOKEN:?ERROR: OM_TOKEN must be set to a valid OpenMetadata JWT before running this script.}"

Copilot uses AI. Check for mistakes.
Comment on lines +455 to +503
/**
* Build a search source for the {@code all} / {@code dataAsset} alias as a per-entity-type
* union: each asset type contributes a clause built with its own configuration (column docs go
* through {@link #buildColumnMultiMatchV2(String)}, every other type through {@link
* #buildBaseQueryV2(String, AssetTypeConfiguration)}), filtered by {@code entityType=<type>}.
* Each entity-type bucket in the aggregation therefore equals what the dedicated index returns
* for the same query, by construction. Avoids the composite-config divergence behind
* github.com/open-metadata/openmetadata-collate#3851.
*/
public ElasticSearchRequestBuilder buildAllAssetsSearchBuilderV2(
String query, int from, int size, boolean explain, boolean includeAggregations) {
AssetTypeConfiguration compositeConfig = buildCompositeAssetConfig(searchSettings);
es.co.elastic.clients.elasticsearch._types.query_dsl.Query baseQuery =
buildPerTypeUnionQueryV2(query);
es.co.elastic.clients.elasticsearch._types.query_dsl.Query finalQuery =
applyFunctionScoringV2(baseQuery, compositeConfig);
es.co.elastic.clients.elasticsearch.core.search.Highlight highlightBuilder =
buildHighlightingIfNeededV2(query, compositeConfig);

ElasticSearchRequestBuilder searchRequestBuilder =
createSearchSourceBuilderV2(finalQuery, from, size);
if (highlightBuilder != null) {
searchRequestBuilder.highlighter(highlightBuilder);
}
if (includeAggregations) {
addConfiguredAggregationsV2(searchRequestBuilder, compositeConfig);
}
searchRequestBuilder.explain(explain);
return searchRequestBuilder;
}

private es.co.elastic.clients.elasticsearch._types.query_dsl.Query buildPerTypeUnionQueryV2(
String query) {
if (isMatchAllQuery(query)) {
return ElasticQueryBuilder.boolQuery().must(ElasticQueryBuilder.matchAllQuery()).build();
}
ElasticQueryBuilder.BoolQueryBuilder union = ElasticQueryBuilder.boolQuery();
Set<String> configuredTypes = new HashSet<>();
for (AssetTypeConfiguration typeConfig : searchSettings.getAssetTypeConfigurations()) {
String assetType = typeConfig.getAssetType();
if (assetType == null || assetType.equals(INDEX_ALL)) {
continue;
}
configuredTypes.add(assetType);
union.should(buildAssetTypeClauseV2(query, assetType, typeConfig));
}
union.should(buildUnconfiguredAssetFallbackV2(query, configuredTypes));
union.minimumShouldMatch(1);
return union.build();
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refactor changes the query semantics for the high-traffic all/dataAsset alias (per-entity-type bool union) and tightens column matching, but there are no automated tests added here to lock in the new bucket-parity and underscore sub-token behavior. Please add/extend tests (unit and/or integration) that assert: (1) index=dataAsset entityType bucket counts match the totals from the corresponding dedicated index for at least table and tableColumn, and (2) a query like first_name does not match columns that only contain first or only name.

Copilot uses AI. Check for mistakes.
Comment on lines +470 to +554
/**
* Build a search source for the {@code all} / {@code dataAsset} alias as a per-entity-type
* union: each asset type contributes a clause built with its own configuration (column docs go
* through {@link #buildColumnMultiMatchV2(String)}, every other type through {@link
* #buildBaseQueryV2(String, AssetTypeConfiguration)}), filtered by {@code entityType=<type>}.
* Each entity-type bucket in the aggregation therefore equals what the dedicated index returns
* for the same query, by construction. Avoids the composite-config divergence behind
* github.com/open-metadata/openmetadata-collate#3851.
*/
public OpenSearchRequestBuilder buildAllAssetsSearchBuilderV2(
String query, int from, int size, boolean explain, boolean includeAggregations) {
AssetTypeConfiguration compositeConfig = getOrBuildCompositeConfig();
os.org.opensearch.client.opensearch._types.query_dsl.Query baseQuery =
buildPerTypeUnionQueryV2(query);
os.org.opensearch.client.opensearch._types.query_dsl.Query finalQuery =
applyFunctionScoringV2(baseQuery, compositeConfig);
os.org.opensearch.client.opensearch.core.search.Highlight highlightBuilder =
buildHighlightingIfNeededV2(query, compositeConfig);

OpenSearchRequestBuilder searchRequestBuilder =
createSearchSourceBuilderV2(finalQuery, from, size);
if (highlightBuilder != null) {
searchRequestBuilder.highlighter(highlightBuilder);
}
if (includeAggregations) {
addConfiguredAggregationsV2(searchRequestBuilder, compositeConfig);
}
searchRequestBuilder.explain(explain);
return searchRequestBuilder;
}

private os.org.opensearch.client.opensearch._types.query_dsl.Query buildPerTypeUnionQueryV2(
String query) {
if (isMatchAllQuery(query)) {
return OpenSearchQueryBuilder.boolQuery()
.must(OpenSearchQueryBuilder.matchAllQuery())
.build();
}
OpenSearchQueryBuilder.BoolQueryBuilder union = OpenSearchQueryBuilder.boolQuery();
Set<String> configuredTypes = new HashSet<>();
for (AssetTypeConfiguration typeConfig : searchSettings.getAssetTypeConfigurations()) {
String assetType = typeConfig.getAssetType();
if (assetType == null || assetType.equals(INDEX_ALL)) {
continue;
}
configuredTypes.add(assetType);
union.should(buildAssetTypeClauseV2(query, assetType, typeConfig));
}
union.should(buildUnconfiguredAssetFallbackV2(query, configuredTypes));
union.minimumShouldMatch(1);
return union.build();
}

private static boolean isMatchAllQuery(String query) {
return query == null || query.trim().isEmpty() || query.trim().equals("*");
}

private os.org.opensearch.client.opensearch._types.query_dsl.Query buildAssetTypeClauseV2(
String query, String assetType, AssetTypeConfiguration typeConfig) {
os.org.opensearch.client.opensearch._types.query_dsl.Query inner =
Entity.TABLE_COLUMN.equals(assetType)
? buildColumnMultiMatchV2(query)
: buildBaseQueryV2(query, typeConfig);
return OpenSearchQueryBuilder.boolQuery()
.filter(OpenSearchQueryBuilder.termQuery(ENTITY_TYPE_FIELD, assetType))
.must(inner)
.build();
}

/**
* Catches asset types that are part of the {@code dataAsset} alias but lack a dedicated entry in
* {@code searchSettings.assetTypeConfigurations} (e.g. {@code glossary}, {@code apiCollection}).
* Without this, docs of those types would silently disappear from the dataAsset alias after the
* per-type-union refactor.
*/
private os.org.opensearch.client.opensearch._types.query_dsl.Query
buildUnconfiguredAssetFallbackV2(String query, Set<String> configuredTypes) {
OpenSearchQueryBuilder.BoolQueryBuilder fallback =
OpenSearchQueryBuilder.boolQuery()
.must(buildBaseQueryV2(query, getOrCreateDefaultConfig()));
for (String configured : configuredTypes) {
fallback.mustNot(OpenSearchQueryBuilder.termQuery(ENTITY_TYPE_FIELD, configured));
}
return fallback.build();
}
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new all/dataAsset per-type-union behavior and the stricter column multi-match are not covered by tests in this PR. Please add/extend tests to ensure (a) dataAsset entityType bucket counts stay in sync with dedicated-index totals (especially tableColumn), and (b) underscore-split identifier queries (e.g. first_name) require all sub-tokens to match (no overmatching on just one token).

Copilot uses AI. Check for mistakes.
* by {@code om_analyzer} must hit some field. Without {@code And}, a query like {@code
* first_name} matches any column whose name contains just {@code first} or just {@code name},
* which both inflates the column index hits and creates the dataAsset/tableColumn count
* mismatch tracked in github issue #3851.
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Javadoc mentions the previous column builder used min_should_match=0, but the underlying multiMatchQuery(..., tieBreaker, fuzziness) helper’s last argument is fuzziness (and no minimum_should_match is set when fuzziness is "0"). Please update the comment to reflect the actual previous behavior so it doesn’t mislead future debugging/tuning.

Suggested change
* mismatch tracked in github issue #3851.
* mismatch tracked in github issue #3851. The previous builder behavior here was equivalent to
* passing {@code fuzziness="0"} to {@code multiMatchQuery(..., tieBreaker, fuzziness)}, which
* disables fuzziness; this helper invocation does not set {@code minimum_should_match}.

Copilot uses AI. Check for mistakes.
Comment on lines +764 to +778
private Table createTableWithMultiTokenColumns(TestNamespace ns, String baseName, String tag) {
String shortId = ns.shortPrefix();

org.openmetadata.schema.services.connections.database.PostgresConnection conn =
DatabaseServices.postgresConnection().hostPort("localhost:5432").username("test").build();

DatabaseService dbService =
DatabaseServices.builder()
.name("agg_svc_" + shortId + "_" + baseName)
.connection(conn)
.description("Test service for dataAsset/tableColumn aggregation parity")
.create();

CreateDatabase dbReq = new CreateDatabase();
dbReq.setName("agg_db_" + shortId + "_" + baseName);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: createTableWithMultiTokenColumns duplicates boilerplate from other helpers

The new createTableWithMultiTokenColumns method (lines 764-809) repeats ~30 lines of service/database/schema creation that are identical to createTableWithColumns, createTableWithNestedColumns, and createTableWithDeeplyNestedColumns. Per the project guidelines (no duplication, extract shared logic), consider extracting the common infra setup into a shared helper that accepts a List<Column> and returns a Table.

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

Comment on lines +567 to +581
void testUnconfiguredAssetTypeFallbackMatchesViaDataAsset(TestNamespace ns) throws Exception {
OpenMetadataClient client = SdkClients.adminClient();
String name = ns.prefix("glossary_unconfigured_fallback");
CreateGlossary req = new CreateGlossary().withName(name).withDescription(name);
Glossary glossary = client.glossaries().create(req);
assertNotNull(glossary);

Awaitility.await()
.atMost(90, TimeUnit.SECONDS)
.pollInterval(500, TimeUnit.MILLISECONDS)
.until(() -> totalHitsForIndex(client, name, "dataAsset") >= 1);

long total = totalHitsForIndex(client, name, "dataAsset");
assertTrue(
total >= 1,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: Glossary test doesn't clean up created glossary entity

testUnconfiguredAssetTypeFallbackMatchesViaDataAsset creates a glossary via client.glossaries().create(req) but never deletes it. The other helpers use TestNamespace-scoped names that presumably get cleaned up, but this glossary is created ad-hoc. If the test framework doesn't auto-clean glossaries, this is a resource leak across test runs. Verify the cleanup strategy or add an @AfterEach / try-finally deletion.

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

mohityadav766 and others added 4 commits April 30, 2026 16:39
…n totals

Refactor the `dataAsset`/`all` alias query path so each entity-type bucket in
the aggregation matches what its dedicated index returns for the same query.
The composite asset config used to merge fields from every type, then apply
phrase/ngram-fuzzy semantics to all docs; column docs got semantics different
from `buildColumnSearchBuilderV2`, which is why the explore search bar's two
calls disagreed on the tableColumn count.

The new `buildAllAssetsSearchBuilderV2` builds a per-entity-type bool union:
each clause is `filter(entityType=<type>) must(<type's own query>)`. The
column branch reuses `buildColumnMultiMatchV2`; every other type goes through
`buildBaseQueryV2` with its dedicated config; an extra `should` covers asset
types in the `dataAsset` alias that lack a config (e.g. `glossary`,
`apiCollection`) using the default config.

Also tightens the column builder to `Operator.And` so multi-token queries
like `first_name` require every sub-token to match somewhere — fixes the
`om_analyzer`-driven over-match where the lenient `Or` + `min_should_match=0`
variant matched any column whose name contained just `first` or just `name`.

`ColumnSearchIndex.getFields()` gains `name.ngram`, `name.compound`,
`displayName.ngram`, `displayName.compound` so prefix queries like `fir`
still match column docs from both `index=tableColumn` and the dataAsset
bucket.

Issue: open-metadata/openmetadata-collate#3851

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds integration tests under ColumnSearchIndexIT that pin the behavior of the
fix in PR #27846:

- testDataAssetTableColumnAggregationMatchesTableColumnTotal: dataAsset
  bucket count for tableColumn equals index=tableColumn total for a
  multi-token query against seeded columns.
- testColumnQueryRequiresAllSubtokensToMatch: query "<tag>_first_name" must
  match the seeded "<tag>_first_name" column but NOT "<tag>_first_id" — pins
  the Operator.And fix that closed the om_analyzer sub-token over-match.
- testDataAssetTableBucketMatchesTableIndexTotal: same parity guarantee for
  the "table" entity-type bucket, exercising the per-type-union path for a
  non-column type.
- testPrefixQueryMatchesViaNgramOnBothPaths: short prefix queries (e.g. the
  first few chars of the seeded tag) must match seeded columns via
  name.ngram and stay in parity across both endpoints.
- testUnconfiguredAssetTypeFallbackMatchesViaDataAsset: a Glossary doc (an
  asset type without an explicit searchSettings.assetTypeConfigurations
  entry) must still surface via index=dataAsset, exercising the fallback
  should clause in buildPerTypeUnionQueryV2.

Issue: open-metadata/openmetadata-collate#3851

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 30, 2026 11:13
- testTopicBucketAndResultSetMatchIndexTopic: pins parity for a non-column
  entity type at both the count and the FQN-set level, exercising the
  per-type-union path for `topic`. The dataAsset alias must return the same
  set of topic FQNs that index=topic returns for the same query.
- testComplexSyntaxQueriesKeepParity: runs four representative complex-
  syntax shapes (quoted phrase, AND, OR, mixed parenthesised) through both
  endpoints and asserts the tableColumn bucket count equals
  index=tableColumn total.

Issue: open-metadata/openmetadata-collate#3851

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

Comment on lines +530 to +531
for (String configured : configuredTypes) {
fallback.mustNot(ElasticQueryBuilder.termQuery(ENTITY_TYPE_FIELD, configured));
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildUnconfiguredAssetFallbackV2 currently emits one mustNot term(entityType=...) per configured type. With many asset types this can create a large bool query and add overhead. Use a single mustNot with a terms query over configuredTypes instead (ElasticQueryBuilder supports termsQuery).

Suggested change
for (String configured : configuredTypes) {
fallback.mustNot(ElasticQueryBuilder.termQuery(ENTITY_TYPE_FIELD, configured));
if (!configuredTypes.isEmpty()) {
fallback.mustNot(ElasticQueryBuilder.termsQuery(ENTITY_TYPE_FIELD, configuredTypes));

Copilot uses AI. Check for mistakes.
Comment on lines +387 to +405

String multiTokenQuery = tag + "_first " + tag + "_address";

Awaitility.await()
.atMost(90, TimeUnit.SECONDS)
.pollInterval(500, TimeUnit.MILLISECONDS)
.until(
() -> {
String r =
client
.search()
.query(multiTokenQuery)
.index("tableColumn")
.size(0)
.deleted(false)
.execute();
JsonNode root = OBJECT_MAPPER.readTree(r);
long total = root.path("hits").path("total").path("value").asLong(-1);
return total >= 3;
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The awaited query tag + "_first " + tag + "_address" is unlikely to ever reach total >= 3 now that the column builder uses a multi_match with Operator.And (each hit must contain all analyzed tokens, so no single column will match both "first" and "*_address"). This can make the Awaitility wait time out and fail the test. Use a query that is expected to match at least one seeded column under the new AND semantics (e.g., target a single column name) and wait for that expected minimum instead of >= 3.

Copilot uses AI. Check for mistakes.
Comment on lines +427 to +432
JsonNode aggBuckets =
OBJECT_MAPPER
.readTree(aggResponse)
.path("aggregations")
.path("sterms#entityType")
.path("buckets");
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code assumes the entity-type aggregation is always under aggregations.sterms#entityType. In this repo, other integration tests account for responses that use either entityType or sterms#entityType depending on backend/aggregation builder. To avoid backend-specific failures, locate the aggregation node by checking both keys (or factor this into a shared helper).

Suggested change
JsonNode aggBuckets =
OBJECT_MAPPER
.readTree(aggResponse)
.path("aggregations")
.path("sterms#entityType")
.path("buckets");
JsonNode aggregations = OBJECT_MAPPER.readTree(aggResponse).path("aggregations");
JsonNode entityTypeAggregation = aggregations.path("entityType");
if (entityTypeAggregation.isMissingNode()) {
entityTypeAggregation = aggregations.path("sterms#entityType");
}
JsonNode aggBuckets = entityTypeAggregation.path("buckets");

Copilot uses AI. Check for mistakes.
Awaitility.await()
.atMost(90, TimeUnit.SECONDS)
.pollInterval(500, TimeUnit.MILLISECONDS)
.until(() -> totalHitsForIndex(client, prefixQuery, "tableColumn") >= 5);
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Awaitility condition totalHitsForIndex(...) >= 5 looks too high for the data created in this test (the helper creates 3 columns). If the random tag is unique (likely), the prefix query may only ever match those seeded columns, causing an unnecessary timeout/flaky failure. Consider waiting for the expected minimum (e.g., >= 3) or simply > 0.

Suggested change
.until(() -> totalHitsForIndex(client, prefixQuery, "tableColumn") >= 5);
.until(() -> totalHitsForIndex(client, prefixQuery, "tableColumn") >= 3);

Copilot uses AI. Check for mistakes.
Comment on lines +609 to +614
long aggTopicBucket = bucketCountFromDataAsset(client, tag, Entity.TOPIC);
assertEquals(
topicTotal,
aggTopicBucket,
"dataAsset topic bucket must equal index=topic total for query " + tag);

Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helper bucketCountFromDataAsset hard-codes aggregations.sterms#entityType. Elsewhere (e.g., SearchResourceIT) the code handles both entityType and sterms#entityType. To prevent backend-dependent failures, update this helper to select whichever aggregation key is present.

Copilot uses AI. Check for mistakes.
Comment on lines +550 to +551
for (String configured : configuredTypes) {
fallback.mustNot(OpenSearchQueryBuilder.termQuery(ENTITY_TYPE_FIELD, configured));
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildUnconfiguredAssetFallbackV2 adds one mustNot term(entityType=...) clause per configured type. With many configured asset types this can bloat the query and slow execution. Prefer a single mustNot with a terms query over the configuredTypes set (the query builder already supports termsQuery).

Suggested change
for (String configured : configuredTypes) {
fallback.mustNot(OpenSearchQueryBuilder.termQuery(ENTITY_TYPE_FIELD, configured));
if (!configuredTypes.isEmpty()) {
fallback.mustNot(
OpenSearchQueryBuilder.termsQuery(
ENTITY_TYPE_FIELD,
configuredTypes.stream().map(FieldValue::of).toList()));

Copilot uses AI. Check for mistakes.
mohityadav766 and others added 3 commits April 30, 2026 16:52
…n totals

Refactor the `dataAsset`/`all` alias query path so each entity-type bucket in
the aggregation matches what its dedicated index returns for the same query.
The composite asset config used to merge fields from every type, then apply
phrase/ngram-fuzzy semantics to all docs; column docs got semantics different
from `buildColumnSearchBuilderV2`, which is why the explore search bar's two
calls disagreed on the tableColumn count.

The new `buildAllAssetsSearchBuilderV2` builds a per-entity-type bool union:
each clause is `filter(entityType=<type>) must(<type's own query>)`. The
column branch reuses `buildColumnMultiMatchV2`; every other type goes through
`buildBaseQueryV2` with its dedicated config; an extra `should` covers asset
types in the `dataAsset` alias that lack a config (e.g. `glossary`,
`apiCollection`) using the default config.

Also tightens the column builder to `Operator.And` so multi-token queries
like `first_name` require every sub-token to match somewhere — fixes the
`om_analyzer`-driven over-match where the lenient `Or` + `min_should_match=0`
variant matched any column whose name contained just `first` or just `name`.

`ColumnSearchIndex.getFields()` gains `name.ngram`, `name.compound`,
`displayName.ngram`, `displayName.compound` so prefix queries like `fir`
still match column docs from both `index=tableColumn` and the dataAsset
bucket.

Adds integration tests in ColumnSearchIndexIT covering:
- multi-token query parity for the tableColumn bucket vs index=tableColumn
- sub-token over-match guard (query `<tag>_first_name` must not match a
  seeded `<tag>_first_id` column)
- table bucket parity (count) and topic parity (count + FQN-set match)
- prefix queries via `name.ngram` keeping parity across both endpoints
- complex-syntax queries (quoted phrase, AND/OR, mixed parenthesised)
  keeping the tableColumn bucket parity
- unconfigured asset type fallback (Glossary docs reachable via dataAsset)

Issue: open-metadata/openmetadata-collate#3851

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 30, 2026 11:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

Comment on lines +735 to +740
JsonNode buckets =
OBJECT_MAPPER
.readTree(response)
.path("aggregations")
.path("sterms#entityType")
.path("buckets");
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bucketCountFromDataAsset assumes the entity-type aggregation lives under aggregations.sterms#entityType, but other integration tests in this repo already need to support aggregations.entityType as well. This helper should check for both keys; otherwise it will silently return 0 and break the parity assertions depending on search backend/serialization.

Suggested change
JsonNode buckets =
OBJECT_MAPPER
.readTree(response)
.path("aggregations")
.path("sterms#entityType")
.path("buckets");
JsonNode root = OBJECT_MAPPER.readTree(response);
JsonNode aggregations = root.path("aggregations");
JsonNode buckets = aggregations.path("sterms#entityType").path("buckets");
if (buckets.isMissingNode() || !buckets.isArray()) {
buckets = aggregations.path("entityType").path("buckets");
}

Copilot uses AI. Check for mistakes.
Comment on lines +388 to +406
String multiTokenQuery = tag + "_first " + tag + "_address";

Awaitility.await()
.atMost(90, TimeUnit.SECONDS)
.pollInterval(500, TimeUnit.MILLISECONDS)
.until(
() -> {
String r =
client
.search()
.query(multiTokenQuery)
.index("tableColumn")
.size(0)
.deleted(false)
.execute();
JsonNode root = OBJECT_MAPPER.readTree(r);
long total = root.path("hits").path("total").path("value").asLong(-1);
return total >= 3;
});
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiTokenQuery is constructed as two different column-name prefixes separated by a space (<tag>_first <tag>_address). With the updated column multi-match using Operator.And, this query is unlikely to match any single column document (no column contains both “first” and “address”), which would cause the Awaitility block to time out and make the test fail. Use a query that is expected to match at least one seeded column doc (e.g., the actual <tag>_first_name term or another query where all required tokens are present in the same column).

Copilot uses AI. Check for mistakes.
Comment on lines +455 to +458
* into {@code [first, name]}; with the old {@code Operator.Or} + {@code min_should_match=0}
* column builder, a column called {@code <tag>_first_id} matched a query of {@code
* <tag>_first_name} because the single token {@code first} was enough. The fix moves the
* column builder to {@code Operator.And}, so every sub-token must match.
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says the old column builder used Operator.Or + min_should_match=0, but the implementation was passing the string "0" as the fuzziness parameter (see ElasticQueryBuilder/OpenSearchQueryBuilder.multiMatchQuery(..., fuzziness)). minimum_should_match was not set at all when fuzziness was "0". Please update the comment to reflect the actual behavior so future debugging isn’t misled.

Suggested change
* into {@code [first, name]}; with the old {@code Operator.Or} + {@code min_should_match=0}
* column builder, a column called {@code <tag>_first_id} matched a query of {@code
* <tag>_first_name} because the single token {@code first} was enough. The fix moves the
* column builder to {@code Operator.And}, so every sub-token must match.
* into {@code [first, name]}; with the old {@code Operator.Or} column builder, the string
* {@code "0"} was passed as the fuzziness value rather than setting {@code
* minimum_should_match}, so a column called {@code <tag>_first_id} matched a query of
* {@code <tag>_first_name} because the single token {@code first} was enough. The fix moves
* the column builder to {@code Operator.And}, so every sub-token must match.

Copilot uses AI. Check for mistakes.
Awaitility.await()
.atMost(90, TimeUnit.SECONDS)
.pollInterval(500, TimeUnit.MILLISECONDS)
.until(() -> totalHitsForIndex(client, tag, "tableColumn") >= 5);
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the prefix test, this Awaitility condition requires totalHitsForIndex(tag, "tableColumn") >= 5 even though the table fixture seeds 3 columns. In a properly isolated namespace this is likely to never reach 5 and will make the test flaky/fail. Lower the threshold to what the fixture guarantees (e.g., >= 1 or >= 3) or wait for a specific known seeded column to appear.

Suggested change
.until(() -> totalHitsForIndex(client, tag, "tableColumn") >= 5);
.until(() -> totalHitsForIndex(client, tag, "tableColumn") >= 3);

Copilot uses AI. Check for mistakes.
Bug fixes the reviewer flagged:

- buildPerTypeUnionQueryV2 (both factories) now guards against
  searchSettings.getAssetTypeConfigurations() returning null/empty by
  falling back to the default config so dataAsset/all queries can't NPE if
  the search settings haven't been initialized.
- testDataAssetTableColumnAggregationMatchesTableColumnTotal previously
  awaited a multi-token query that no single seeded column could satisfy
  under the new Operator.And semantics, leaving the parity assertion
  trivially 0 == 0; switched to a query that actually matches multiple
  seeded columns.
- testColumnQueryRequiresAllSubtokensToMatch was relying on a column name
  (`<tag>_first_id`) that was never created, so the negative assertion was
  trivially true. createTableWithMultiTokenColumns now seeds first_id (and
  alpha/bravo columns used by the complex-syntax and prefix tests) so all
  the parity assertions exercise real indexed documents.
- bucketCountFromDataAsset accepts both `entityType` and
  `sterms#entityType` aggregation key shapes so the helper doesn't break
  on backends that label the bucket differently.
- Javadoc on buildColumnMultiMatchV2 (both factories) corrected to
  describe the actual previous shape (`Operator.Or` + `fuzziness="0"`,
  which left minimum_should_match unset) instead of the inaccurate
  `min_should_match=0`.

Issue: open-metadata/openmetadata-collate#3851

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
return SdkClients.adminClient().tables().create(tableRequest);
}

private static final int MULTI_TOKEN_SEED_COUNT = 7;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: MULTI_TOKEN_SEED_COUNT constant is defined but never used

The constant MULTI_TOKEN_SEED_COUNT = 7 was added at line 916 but is never referenced anywhere in the file. The magic numbers 5 in the await conditions (e.g., >= 5 at lines 505 and 605) and 2 at line 396 appear to be the intended use sites but still use raw literals. Either use the constant or remove it.

Suggested fix:

Either replace the magic numbers with the constant:
  .until(() -> totalHitsForIndex(client, tag, "tableColumn") >= MULTI_TOKEN_SEED_COUNT - 2);
or remove the unused constant if it was only added for documentation purposes.

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

Adds a `termsQuery(String, Collection<String>)` helper to both
OpenSearchQueryBuilder and ElasticQueryBuilder, then uses it in
buildUnconfiguredAssetFallbackV2 to replace the per-type `mustNot
term(entityType=...)` chain with a single `mustNot terms(entityType=[...])`.
Keeps the bool query small regardless of how many configured asset types
exist — the previous shape grew one clause per configured type.

Issue: open-metadata/openmetadata-collate#3851

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 30, 2026 11:59
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 30, 2026

Code Review 👍 Approved with suggestions 4 resolved / 7 findings

Aligns dataAsset aggregation counts with index totals while resolving several test and null-pointer edge cases. Please refactor the redundant boilerplate in createTableWithMultiTokenColumns, remove the unused MULTI_TOKEN_SEED_COUNT constant, and ensure the glossary entity is properly cleaned up in tests.

💡 Quality: createTableWithMultiTokenColumns duplicates boilerplate from other helpers

📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/ColumnSearchIndexIT.java:764-778 📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/ColumnSearchIndexIT.java:759-762

The new createTableWithMultiTokenColumns method (lines 764-809) repeats ~30 lines of service/database/schema creation that are identical to createTableWithColumns, createTableWithNestedColumns, and createTableWithDeeplyNestedColumns. Per the project guidelines (no duplication, extract shared logic), consider extracting the common infra setup into a shared helper that accepts a List<Column> and returns a Table.

💡 Quality: Glossary test doesn't clean up created glossary entity

📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/ColumnSearchIndexIT.java:567-581

testUnconfiguredAssetTypeFallbackMatchesViaDataAsset creates a glossary via client.glossaries().create(req) but never deletes it. The other helpers use TestNamespace-scoped names that presumably get cleaned up, but this glossary is created ad-hoc. If the test framework doesn't auto-clean glossaries, this is a resource leak across test runs. Verify the cleanup strategy or add an @AfterEach / try-finally deletion.

💡 Quality: MULTI_TOKEN_SEED_COUNT constant is defined but never used

📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/ColumnSearchIndexIT.java:916

The constant MULTI_TOKEN_SEED_COUNT = 7 was added at line 916 but is never referenced anywhere in the file. The magic numbers 5 in the await conditions (e.g., >= 5 at lines 505 and 605) and 2 at line 396 appear to be the intended use sites but still use raw literals. Either use the constant or remove it.

Suggested fix
Either replace the magic numbers with the constant:
  .until(() -> totalHitsForIndex(client, tag, "tableColumn") >= MULTI_TOKEN_SEED_COUNT - 2);
or remove the unused constant if it was only added for documentation purposes.
✅ 4 resolved
Bug: Sub-token over-match test is trivially true (missing first_id column)

📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/ColumnSearchIndexIT.java:461-475 📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/ColumnSearchIndexIT.java:790-804
The test testColumnQueryRequiresAllSubtokensToMatch asserts that querying <tag>_first_name must NOT return a hit for <tag>_first_id. However, createTableWithMultiTokenColumns never creates a column named <tag>_first_id — it only seeds first_name, last_name, and address. The assertFalse(hits.contains(firstIdColumn)) therefore passes trivially regardless of whether the Operator.And fix is in place, making this test unable to catch a regression.

Add a <tag>_first_id column to createTableWithMultiTokenColumns so the negative assertion is actually exercised against a real indexed document.

Bug: NPE if getAssetTypeConfigurations() returns null

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchSourceBuilderFactory.java:493 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchSourceBuilderFactory.java:510
Both buildPerTypeUnionQueryV2 and buildCompositeAssetConfig (called from buildAllAssetsSearchBuilderV2) iterate searchSettings.getAssetTypeConfigurations() via an enhanced for-loop without a null guard. Other call sites in the codebase (e.g. SearchSettingsHandler:26) explicitly check for null before iterating. If the search settings are misconfigured or not yet initialized, this will throw a NullPointerException on every dataAsset/all query — a high-traffic path.

Bug: Complex-syntax test waits for ≥5 columns but only 3 are created

📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/ColumnSearchIndexIT.java:639-642 📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/ColumnSearchIndexIT.java:916-930
In testComplexSyntaxQueriesKeepParity, the Awaitility poll waits for totalHitsForIndex(client, tag, "tableColumn") >= 5, but the setup calls createTableWithMultiTokenColumns which creates exactly 3 columns (tag_first_name, tag_last_name, tag_address). Since each test gets its own TestNamespace, no other test contributes columns with the same tag. The await condition can never be satisfied, so the test will always time out after 90 seconds and fail.

Edge Case: Complex-syntax quoted-phrase query may match zero hits trivially

📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/ColumnSearchIndexIT.java:644-649 📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/ColumnSearchIndexIT.java:916-930
The first complex query is "tag alpha" (quoted phrase). None of the columns created by createTableWithMultiTokenColumns contain the word alpha — column names are tag_first_name, tag_last_name, tag_address. This means totalHitsForIndex and bucketCountFromDataAsset will both return 0, making the parity assertion trivially true. Similarly, tag AND alpha and (tag AND alpha) OR (tag AND bravo) will match 0 columns. Only tag OR alpha might produce a meaningful count. Consider adding a column whose name includes alpha (e.g. tag_alpha_metric) so these complex-syntax shapes exercise real matching.

🤖 Prompt for agents
Code Review: Aligns dataAsset aggregation counts with index totals while resolving several test and null-pointer edge cases. Please refactor the redundant boilerplate in `createTableWithMultiTokenColumns`, remove the unused `MULTI_TOKEN_SEED_COUNT` constant, and ensure the glossary entity is properly cleaned up in tests.

1. 💡 Quality: createTableWithMultiTokenColumns duplicates boilerplate from other helpers
   Files: openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/ColumnSearchIndexIT.java:764-778, openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/ColumnSearchIndexIT.java:759-762

   The new `createTableWithMultiTokenColumns` method (lines 764-809) repeats ~30 lines of service/database/schema creation that are identical to `createTableWithColumns`, `createTableWithNestedColumns`, and `createTableWithDeeplyNestedColumns`. Per the project guidelines (no duplication, extract shared logic), consider extracting the common infra setup into a shared helper that accepts a `List<Column>` and returns a `Table`.

2. 💡 Quality: Glossary test doesn't clean up created glossary entity
   Files: openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/ColumnSearchIndexIT.java:567-581

   `testUnconfiguredAssetTypeFallbackMatchesViaDataAsset` creates a glossary via `client.glossaries().create(req)` but never deletes it. The other helpers use TestNamespace-scoped names that presumably get cleaned up, but this glossary is created ad-hoc. If the test framework doesn't auto-clean glossaries, this is a resource leak across test runs. Verify the cleanup strategy or add an `@AfterEach` / try-finally deletion.

3. 💡 Quality: MULTI_TOKEN_SEED_COUNT constant is defined but never used
   Files: openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/ColumnSearchIndexIT.java:916

   The constant `MULTI_TOKEN_SEED_COUNT = 7` was added at line 916 but is never referenced anywhere in the file. The magic numbers `5` in the `await` conditions (e.g., `>= 5` at lines 505 and 605) and `2` at line 396 appear to be the intended use sites but still use raw literals. Either use the constant or remove it.

   Suggested fix:
   Either replace the magic numbers with the constant:
     .until(() -> totalHitsForIndex(client, tag, "tableColumn") >= MULTI_TOKEN_SEED_COUNT - 2);
   or remove the unused constant if it was only added for documentation purposes.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.

@sonarqubecloud
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown
Contributor

🔴 Playwright Results — 156 failure(s), 16 flaky

✅ 1494 passed · ❌ 156 failed · 🟡 16 flaky · ⏭️ 152 skipped

Shard Passed Failed Flaky Skipped
🔴 Shard 1 210 26 1 66
🔴 Shard 2 704 39 8 11
🔴 Shard 3 580 91 7 75

Genuine Failures (failed on all attempts)

Features/DataAssetRulesEnabled.spec.ts › Verify the ApiEndpoint Entity Action items after rules is Enabled (shard 1)
�[31mTest timeout of 180000ms exceeded.�[39m
Features/DataAssetRulesEnabled.spec.ts › Verify the Store Procedure Entity Action items after rules is Enabled (shard 1)
�[31mTest timeout of 180000ms exceeded.�[39m
Features/DataAssetRulesEnabled.spec.ts › Verify the Dashboard Entity Action items after rules is Enabled (shard 1)
�[31mTest timeout of 180000ms exceeded.�[39m
Features/DataAssetRulesEnabled.spec.ts › Verify the Pipeline Entity Action items after rules is Enabled (shard 1)
�[31mTest timeout of 180000ms exceeded.�[39m
Features/DataAssetRulesEnabled.spec.ts › Verify the MlModel Entity Action items after rules is Enabled (shard 1)
�[31mTest timeout of 180000ms exceeded.�[39m
Features/DataAssetRulesEnabled.spec.ts › Verify the DashboardDataModel Entity Action items after rules is Enabled (shard 1)
�[31mTest timeout of 180000ms exceeded.�[39m
Features/DataAssetRulesEnabled.spec.ts › Verify the Chart Entity Action items after rules is Enabled (shard 1)
�[31mTest timeout of 180000ms exceeded.�[39m
Features/DataAssetRulesEnabled.spec.ts › Verify the Directory Entity Action items after rules is Enabled (shard 1)
�[31mTest timeout of 180000ms exceeded.�[39m
Features/DataAssetRulesEnabled.spec.ts › Verify the File Entity Action items after rules is Enabled (shard 1)
�[31mTest timeout of 180000ms exceeded.�[39m
Features/DataAssetRulesEnabled.spec.ts › Verify the Spreadsheet Entity Action items after rules is Enabled (shard 1)
�[31mTest timeout of 180000ms exceeded.�[39m
Features/DataAssetRulesEnabled.spec.ts › Verify the Worksheet Entity Action items after rules is Enabled (shard 1)
�[31mTest timeout of 180000ms exceeded.�[39m
Features/CustomizeDetailPage.spec.ts › Dashboard - customization should work (shard 1)
�[31mTest timeout of 180000ms exceeded.�[39m
Features/CustomizeDetailPage.spec.ts › Ml Model - customization should work (shard 1)
�[31mTest timeout of 180000ms exceeded.�[39m
Features/CustomizeDetailPage.spec.ts › Pipeline - customization should work (shard 1)
�[31mTest timeout of 180000ms exceeded.�[39m
Features/CustomizeDetailPage.spec.ts › Dashboard Data Model - customization should work (shard 1)
�[31mTest timeout of 180000ms exceeded.�[39m
Features/CustomizeDetailPage.spec.ts › Stored Procedure - customization should work (shard 1)
�[31mTest timeout of 180000ms exceeded.�[39m
Features/CustomizeDetailPage.spec.ts › API Endpoint - customization should work (shard 1)
�[31mTest timeout of 180000ms exceeded.�[39m
Features/Dashboards.spec.ts › should be able to toggle between deleted and non-deleted charts (shard 1)
�[31mTest timeout of 180000ms exceeded.�[39m
Features/DescriptionSuggestion.spec.ts › should add and accept a requested topic schema field description (shard 1)
�[31mTest timeout of 180000ms exceeded.�[39m
Features/TagsSuggestion.spec.ts › should add and accept requested tags for a topic schema field (shard 1)
�[31mTest timeout of 180000ms exceeded.�[39m
Flow/Metric.spec.ts › Metric creation flow should work (shard 1)
�[31mTest timeout of 180000ms exceeded while running "beforeEach" hook.�[39m
Flow/Metric.spec.ts › Verify Metric Type Update (shard 1)
�[31mTest timeout of 180000ms exceeded while running "beforeEach" hook.�[39m
Flow/Metric.spec.ts › Verify Unit of Measurement Update (shard 1)
�[31mTest timeout of 180000ms exceeded while running "beforeEach" hook.�[39m
Flow/Metric.spec.ts › Verify Granularity Update (shard 1)
�[31mTest timeout of 180000ms exceeded while running "beforeEach" hook.�[39m
Flow/Metric.spec.ts › verify metric expression update (shard 1)
�[31mTest timeout of 180000ms exceeded while running "beforeEach" hook.�[39m
Flow/Metric.spec.ts › Verify Related Metrics Update (shard 1)
�[31mTest timeout of 180000ms exceeded while running "beforeEach" hook.�[39m
Features/ChangeSummaryBadge.spec.ts › AI badge should appear on entity description with Suggested source (shard 2)
Error: �[2mexpect(�[22m�[31mreceived�[39m�[2m).�[22mtoBe�[2m(�[22m�[32mexpected�[39m�[2m) // Object.is equality�[22m

Expected: �[32m200�[39m
Received: �[31m500�[39m
Features/ChangeSummaryBadge.spec.ts › AI badge should appear on column description with Suggested source (shard 2)
�[31mTest timeout of 60000ms exceeded.�[39m
Features/Container.spec.ts › expand / collapse should not appear after updating nested fields for container (shard 2)
�[31mTest timeout of 180000ms exceeded.�[39m
Features/Container.spec.ts › Copy column link button should copy the column URL to clipboard (shard 2)
�[31mTest timeout of 180000ms exceeded.�[39m

... and 126 more failures

🟡 16 flaky test(s) (passed on retry)
  • Features/DataAssetRulesEnabled.spec.ts › Verify the Metric Entity Action items after rules is Enabled (shard 1, 2 retries)
  • Features/ActivityAPI.spec.ts › Activity event is created when description is updated (shard 2, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event is created when owner is added (shard 2, 1 retry)
  • Features/DataQuality/TestCaseImportExportE2eFlow.spec.ts › Admin: Complete export-import-validate flow (shard 2, 1 retry)
  • Features/DataQuality/TestCaseResultPermissions.spec.ts › User with only VIEW cannot PATCH results (shard 2, 1 retry)
  • Features/ExploreQuickFilters.spec.ts › should search for multiple values along with null filters (shard 2, 1 retry)
  • Features/ExploreQuickFilters.spec.ts › tier with assigned asset appears in dropdown, tier without asset does not (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should start term as Draft when glossary has reviewers (shard 2, 1 retry)
  • Features/IncidentManager.spec.ts › Complete Incident lifecycle with table owner (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Flow/NestedChildrenUpdates.spec.ts › should update nested column description immediately without page refresh (shard 3, 2 retries)
  • Flow/PlatformLineage.spec.ts › Verify Platform Lineage View (shard 3, 2 retries)
  • Flow/ServiceDocPanel.spec.ts › should render headings not raw markdown (shard 3, 2 retries)
  • Flow/ServiceDocPanel.spec.ts › should render image in Mssql doc panel (shard 3, 1 retry)
  • Flow/ServiceDocPanel.spec.ts › should only ever have one section highlighted at a time (shard 3, 2 retries)
  • Flow/ServiceDocPanel.spec.ts › should load the correct doc file for the selected service type (shard 3, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants