fix(search): align dataAsset aggregation counts with index=tableColumn totals#27846
fix(search): align dataAsset aggregation counts with index=tableColumn totals#27846mohityadav766 wants to merge 12 commits intomainfrom
Conversation
…n totals Refactor the `dataAsset`/`all` alias query path so each entity-type bucket in the aggregation matches what its dedicated index returns for the same query. The composite asset config used to merge fields from every type, then apply phrase/ngram-fuzzy semantics to all docs; column docs got semantics different from `buildColumnSearchBuilderV2`, which is why the explore search bar's two calls disagreed on the tableColumn count. The new `buildAllAssetsSearchBuilderV2` builds a per-entity-type bool union: each clause is `filter(entityType=<type>) must(<type's own query>)`. The column branch reuses `buildColumnMultiMatchV2`; every other type goes through `buildBaseQueryV2` with its dedicated config; an extra `should` covers asset types in the `dataAsset` alias that lack a config (e.g. `glossary`, `apiCollection`) using the default config. Also tightens the column builder to `Operator.And` so multi-token queries like `first_name` require every sub-token to match somewhere — fixes the `om_analyzer`-driven over-match where the lenient `Or` + `min_should_match=0` variant matched any column whose name contained just `first` or just `name`. `ColumnSearchIndex.getFields()` gains `name.ngram`, `name.compound`, `displayName.ngram`, `displayName.compound` so prefix queries like `fir` still match column docs from both `index=tableColumn` and the dataAsset bucket. Includes `scripts/reproduce_column_agg_mismatch.sh` — multi-query probe that exits non-zero on any divergence between `index=dataAsset` (aggregation bucket) and `index=tableColumn` (total) for the same query. Issue: open-metadata/openmetadata-collate#3851 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds integration tests under ColumnSearchIndexIT that pin the behavior of the fix in PR #27846: - testDataAssetTableColumnAggregationMatchesTableColumnTotal: dataAsset bucket count for tableColumn equals index=tableColumn total for a multi-token query against seeded columns. - testColumnQueryRequiresAllSubtokensToMatch: query "<tag>_first_name" must match the seeded "<tag>_first_name" column but NOT "<tag>_first_id" — pins the Operator.And fix that closed the om_analyzer sub-token over-match. - testDataAssetTableBucketMatchesTableIndexTotal: same parity guarantee for the "table" entity-type bucket, exercising the per-type-union path for a non-column type. - testPrefixQueryMatchesViaNgramOnBothPaths: short prefix queries (e.g. the first few chars of the seeded tag) must match seeded columns via name.ngram and stay in parity across both endpoints. - testUnconfiguredAssetTypeFallbackMatchesViaDataAsset: a Glossary doc (an asset type without an explicit searchSettings.assetTypeConfigurations entry) must still surface via index=dataAsset, exercising the fallback should clause in buildPerTypeUnionQueryV2. Issue: open-metadata/openmetadata-collate#3851 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the search query-building logic for the all / dataAsset aliases so that entity-type aggregation bucket counts (notably tableColumn) align with the totals returned by the corresponding dedicated index queries, and tightens column query matching to avoid underscore sub-token overmatching.
Changes:
- Route
index=allandindex=dataAssetthrough a new per-entity-type union query builder (ElasticSearch + OpenSearch) to align aggregation bucket counts with dedicated index totals. - Tighten
tableColumnquery behavior by reusing a shared column multi-match builder withOperator.And. - Expand column searchable fields to include
name.ngram,name.compound,displayName.ngram, anddisplayName.compoundto preserve prefix/partial matching behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| scripts/reproduce_column_agg_mismatch.sh | Adds a probe script to reproduce/verify the dataAsset aggregation vs tableColumn totals mismatch. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchSourceBuilderFactory.java | Implements the per-entity-type union query for all/dataAsset and centralizes the stricter column multi-match. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchSourceBuilderFactory.java | Mirrors the per-entity-type union query approach for ElasticSearch and centralizes the stricter column multi-match. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/indexes/ColumnSearchIndex.java | Adds ngram/compound subfields to the column field boost map to support prefix/partial matching. |
| * field. Without {@code And}, a query like {@code first_name} matches any column whose name | ||
| * contains just {@code first} or just {@code name}, which both inflates the column index hits | ||
| * and creates the dataAsset/tableColumn count mismatch tracked in github issue #3851. |
There was a problem hiding this comment.
The Javadoc references the old behavior as Operator.Or + min_should_match=0, but the helper used here takes fuzziness as the last parameter and only sets minimum_should_match when fuzziness is enabled and operator is OR. Please adjust the comment to match the real previous query shape to avoid confusion.
| * field. Without {@code And}, a query like {@code first_name} matches any column whose name | |
| * contains just {@code first} or just {@code name}, which both inflates the column index hits | |
| * and creates the dataAsset/tableColumn count mismatch tracked in github issue #3851. | |
| * field. Previously this query used the same helper with {@code Operator.Or} and a last | |
| * argument of {@code "0"}; with that OR-style query, a search like {@code first_name} matched | |
| * any column whose name contained just {@code first} or just {@code name}, which inflated the | |
| * column index hits and created the dataAsset/tableColumn count mismatch tracked in github | |
| * issue #3851. |
| set -euo pipefail | ||
|
|
||
| HOST="${OM_HOST:-http://localhost:8585}" | ||
| TOKEN="${OM_TOKEN:-eyJraWQiOiJHYjM4OWEtOWY3Ni1nZGpzLWE5MmotMDI0MmJrOTQzNTYiLCJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJzdWIiOiJhZG1pbiIsImlzQm90IjpmYWxzZSwiaXNzIjoib3Blbi1tZXRhZGF0YS5vcmciLCJpYXQiOjE2NjM5Mzg0NjIsImVtYWlsIjoiYWRtaW5Ab3Blbm1ldGFkYXRhLm9yZyJ9.tS8um_5DKu7HgzGBzS1VTA5uUjKWOCU0B_j08WXBiEC0mr0zNREkqVfwFDD-d24HlNEbrqioLsBuFRiwIWKc1m_ZlVQbG7P36RUxhuv2vbSp80FKyNM-Tj93FDzq91jsyNmsQhyNv_fNr3TXfzzSPjHt8Go0FMMP66weoKMgW2PbXlhVKwEuXUHyakLLzewm9UMeQaEiRzhiTMU3UkLXcKbYEJJvfNFcLwSl9W8JCO_l0Yj3ud-qt_nQYEZwqW6u5nfdQllN133iikV4fM5QZsMCnm8Rq1mvLR0y9bmJiD7fwM1tmJ791TUWqmKaTnP49U493VanKpUAfzIiOiIbhg}" |
There was a problem hiding this comment.
The script hard-codes a full JWT as the default OM_TOKEN value. Even if it’s intended for local dev, committing tokens is a security risk and encourages running with a privileged credential. Please remove the embedded token and instead require OM_TOKEN to be provided (or fail fast with a clear message / optionally prompt).
| TOKEN="${OM_TOKEN:-eyJraWQiOiJHYjM4OWEtOWY3Ni1nZGpzLWE5MmotMDI0MmJrOTQzNTYiLCJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJzdWIiOiJhZG1pbiIsImlzQm90IjpmYWxzZSwiaXNzIjoib3Blbi1tZXRhZGF0YS5vcmciLCJpYXQiOjE2NjM5Mzg0NjIsImVtYWlsIjoiYWRtaW5Ab3Blbm1ldGFkYXRhLm9yZyJ9.tS8um_5DKu7HgzGBzS1VTA5uUjKWOCU0B_j08WXBiEC0mr0zNREkqVfwFDD-d24HlNEbrqioLsBuFRiwIWKc1m_ZlVQbG7P36RUxhuv2vbSp80FKyNM-Tj93FDzq91jsyNmsQhyNv_fNr3TXfzzSPjHt8Go0FMMP66weoKMgW2PbXlhVKwEuXUHyakLLzewm9UMeQaEiRzhiTMU3UkLXcKbYEJJvfNFcLwSl9W8JCO_l0Yj3ud-qt_nQYEZwqW6u5nfdQllN133iikV4fM5QZsMCnm8Rq1mvLR0y9bmJiD7fwM1tmJ791TUWqmKaTnP49U493VanKpUAfzIiOiIbhg}" | |
| TOKEN="${OM_TOKEN:?ERROR: OM_TOKEN must be set to a valid OpenMetadata JWT before running this script.}" |
| /** | ||
| * Build a search source for the {@code all} / {@code dataAsset} alias as a per-entity-type | ||
| * union: each asset type contributes a clause built with its own configuration (column docs go | ||
| * through {@link #buildColumnMultiMatchV2(String)}, every other type through {@link | ||
| * #buildBaseQueryV2(String, AssetTypeConfiguration)}), filtered by {@code entityType=<type>}. | ||
| * Each entity-type bucket in the aggregation therefore equals what the dedicated index returns | ||
| * for the same query, by construction. Avoids the composite-config divergence behind | ||
| * github.com/open-metadata/openmetadata-collate#3851. | ||
| */ | ||
| public ElasticSearchRequestBuilder buildAllAssetsSearchBuilderV2( | ||
| String query, int from, int size, boolean explain, boolean includeAggregations) { | ||
| AssetTypeConfiguration compositeConfig = buildCompositeAssetConfig(searchSettings); | ||
| es.co.elastic.clients.elasticsearch._types.query_dsl.Query baseQuery = | ||
| buildPerTypeUnionQueryV2(query); | ||
| es.co.elastic.clients.elasticsearch._types.query_dsl.Query finalQuery = | ||
| applyFunctionScoringV2(baseQuery, compositeConfig); | ||
| es.co.elastic.clients.elasticsearch.core.search.Highlight highlightBuilder = | ||
| buildHighlightingIfNeededV2(query, compositeConfig); | ||
|
|
||
| ElasticSearchRequestBuilder searchRequestBuilder = | ||
| createSearchSourceBuilderV2(finalQuery, from, size); | ||
| if (highlightBuilder != null) { | ||
| searchRequestBuilder.highlighter(highlightBuilder); | ||
| } | ||
| if (includeAggregations) { | ||
| addConfiguredAggregationsV2(searchRequestBuilder, compositeConfig); | ||
| } | ||
| searchRequestBuilder.explain(explain); | ||
| return searchRequestBuilder; | ||
| } | ||
|
|
||
| private es.co.elastic.clients.elasticsearch._types.query_dsl.Query buildPerTypeUnionQueryV2( | ||
| String query) { | ||
| if (isMatchAllQuery(query)) { | ||
| return ElasticQueryBuilder.boolQuery().must(ElasticQueryBuilder.matchAllQuery()).build(); | ||
| } | ||
| ElasticQueryBuilder.BoolQueryBuilder union = ElasticQueryBuilder.boolQuery(); | ||
| Set<String> configuredTypes = new HashSet<>(); | ||
| for (AssetTypeConfiguration typeConfig : searchSettings.getAssetTypeConfigurations()) { | ||
| String assetType = typeConfig.getAssetType(); | ||
| if (assetType == null || assetType.equals(INDEX_ALL)) { | ||
| continue; | ||
| } | ||
| configuredTypes.add(assetType); | ||
| union.should(buildAssetTypeClauseV2(query, assetType, typeConfig)); | ||
| } | ||
| union.should(buildUnconfiguredAssetFallbackV2(query, configuredTypes)); | ||
| union.minimumShouldMatch(1); | ||
| return union.build(); |
There was a problem hiding this comment.
This refactor changes the query semantics for the high-traffic all/dataAsset alias (per-entity-type bool union) and tightens column matching, but there are no automated tests added here to lock in the new bucket-parity and underscore sub-token behavior. Please add/extend tests (unit and/or integration) that assert: (1) index=dataAsset entityType bucket counts match the totals from the corresponding dedicated index for at least table and tableColumn, and (2) a query like first_name does not match columns that only contain first or only name.
| /** | ||
| * Build a search source for the {@code all} / {@code dataAsset} alias as a per-entity-type | ||
| * union: each asset type contributes a clause built with its own configuration (column docs go | ||
| * through {@link #buildColumnMultiMatchV2(String)}, every other type through {@link | ||
| * #buildBaseQueryV2(String, AssetTypeConfiguration)}), filtered by {@code entityType=<type>}. | ||
| * Each entity-type bucket in the aggregation therefore equals what the dedicated index returns | ||
| * for the same query, by construction. Avoids the composite-config divergence behind | ||
| * github.com/open-metadata/openmetadata-collate#3851. | ||
| */ | ||
| public OpenSearchRequestBuilder buildAllAssetsSearchBuilderV2( | ||
| String query, int from, int size, boolean explain, boolean includeAggregations) { | ||
| AssetTypeConfiguration compositeConfig = getOrBuildCompositeConfig(); | ||
| os.org.opensearch.client.opensearch._types.query_dsl.Query baseQuery = | ||
| buildPerTypeUnionQueryV2(query); | ||
| os.org.opensearch.client.opensearch._types.query_dsl.Query finalQuery = | ||
| applyFunctionScoringV2(baseQuery, compositeConfig); | ||
| os.org.opensearch.client.opensearch.core.search.Highlight highlightBuilder = | ||
| buildHighlightingIfNeededV2(query, compositeConfig); | ||
|
|
||
| OpenSearchRequestBuilder searchRequestBuilder = | ||
| createSearchSourceBuilderV2(finalQuery, from, size); | ||
| if (highlightBuilder != null) { | ||
| searchRequestBuilder.highlighter(highlightBuilder); | ||
| } | ||
| if (includeAggregations) { | ||
| addConfiguredAggregationsV2(searchRequestBuilder, compositeConfig); | ||
| } | ||
| searchRequestBuilder.explain(explain); | ||
| return searchRequestBuilder; | ||
| } | ||
|
|
||
| private os.org.opensearch.client.opensearch._types.query_dsl.Query buildPerTypeUnionQueryV2( | ||
| String query) { | ||
| if (isMatchAllQuery(query)) { | ||
| return OpenSearchQueryBuilder.boolQuery() | ||
| .must(OpenSearchQueryBuilder.matchAllQuery()) | ||
| .build(); | ||
| } | ||
| OpenSearchQueryBuilder.BoolQueryBuilder union = OpenSearchQueryBuilder.boolQuery(); | ||
| Set<String> configuredTypes = new HashSet<>(); | ||
| for (AssetTypeConfiguration typeConfig : searchSettings.getAssetTypeConfigurations()) { | ||
| String assetType = typeConfig.getAssetType(); | ||
| if (assetType == null || assetType.equals(INDEX_ALL)) { | ||
| continue; | ||
| } | ||
| configuredTypes.add(assetType); | ||
| union.should(buildAssetTypeClauseV2(query, assetType, typeConfig)); | ||
| } | ||
| union.should(buildUnconfiguredAssetFallbackV2(query, configuredTypes)); | ||
| union.minimumShouldMatch(1); | ||
| return union.build(); | ||
| } | ||
|
|
||
| private static boolean isMatchAllQuery(String query) { | ||
| return query == null || query.trim().isEmpty() || query.trim().equals("*"); | ||
| } | ||
|
|
||
| private os.org.opensearch.client.opensearch._types.query_dsl.Query buildAssetTypeClauseV2( | ||
| String query, String assetType, AssetTypeConfiguration typeConfig) { | ||
| os.org.opensearch.client.opensearch._types.query_dsl.Query inner = | ||
| Entity.TABLE_COLUMN.equals(assetType) | ||
| ? buildColumnMultiMatchV2(query) | ||
| : buildBaseQueryV2(query, typeConfig); | ||
| return OpenSearchQueryBuilder.boolQuery() | ||
| .filter(OpenSearchQueryBuilder.termQuery(ENTITY_TYPE_FIELD, assetType)) | ||
| .must(inner) | ||
| .build(); | ||
| } | ||
|
|
||
| /** | ||
| * Catches asset types that are part of the {@code dataAsset} alias but lack a dedicated entry in | ||
| * {@code searchSettings.assetTypeConfigurations} (e.g. {@code glossary}, {@code apiCollection}). | ||
| * Without this, docs of those types would silently disappear from the dataAsset alias after the | ||
| * per-type-union refactor. | ||
| */ | ||
| private os.org.opensearch.client.opensearch._types.query_dsl.Query | ||
| buildUnconfiguredAssetFallbackV2(String query, Set<String> configuredTypes) { | ||
| OpenSearchQueryBuilder.BoolQueryBuilder fallback = | ||
| OpenSearchQueryBuilder.boolQuery() | ||
| .must(buildBaseQueryV2(query, getOrCreateDefaultConfig())); | ||
| for (String configured : configuredTypes) { | ||
| fallback.mustNot(OpenSearchQueryBuilder.termQuery(ENTITY_TYPE_FIELD, configured)); | ||
| } | ||
| return fallback.build(); | ||
| } |
There was a problem hiding this comment.
The new all/dataAsset per-type-union behavior and the stricter column multi-match are not covered by tests in this PR. Please add/extend tests to ensure (a) dataAsset entityType bucket counts stay in sync with dedicated-index totals (especially tableColumn), and (b) underscore-split identifier queries (e.g. first_name) require all sub-tokens to match (no overmatching on just one token).
| * by {@code om_analyzer} must hit some field. Without {@code And}, a query like {@code | ||
| * first_name} matches any column whose name contains just {@code first} or just {@code name}, | ||
| * which both inflates the column index hits and creates the dataAsset/tableColumn count | ||
| * mismatch tracked in github issue #3851. |
There was a problem hiding this comment.
The Javadoc mentions the previous column builder used min_should_match=0, but the underlying multiMatchQuery(..., tieBreaker, fuzziness) helper’s last argument is fuzziness (and no minimum_should_match is set when fuzziness is "0"). Please update the comment to reflect the actual previous behavior so it doesn’t mislead future debugging/tuning.
| * mismatch tracked in github issue #3851. | |
| * mismatch tracked in github issue #3851. The previous builder behavior here was equivalent to | |
| * passing {@code fuzziness="0"} to {@code multiMatchQuery(..., tieBreaker, fuzziness)}, which | |
| * disables fuzziness; this helper invocation does not set {@code minimum_should_match}. |
| private Table createTableWithMultiTokenColumns(TestNamespace ns, String baseName, String tag) { | ||
| String shortId = ns.shortPrefix(); | ||
|
|
||
| org.openmetadata.schema.services.connections.database.PostgresConnection conn = | ||
| DatabaseServices.postgresConnection().hostPort("localhost:5432").username("test").build(); | ||
|
|
||
| DatabaseService dbService = | ||
| DatabaseServices.builder() | ||
| .name("agg_svc_" + shortId + "_" + baseName) | ||
| .connection(conn) | ||
| .description("Test service for dataAsset/tableColumn aggregation parity") | ||
| .create(); | ||
|
|
||
| CreateDatabase dbReq = new CreateDatabase(); | ||
| dbReq.setName("agg_db_" + shortId + "_" + baseName); |
There was a problem hiding this comment.
💡 Quality: createTableWithMultiTokenColumns duplicates boilerplate from other helpers
The new createTableWithMultiTokenColumns method (lines 764-809) repeats ~30 lines of service/database/schema creation that are identical to createTableWithColumns, createTableWithNestedColumns, and createTableWithDeeplyNestedColumns. Per the project guidelines (no duplication, extract shared logic), consider extracting the common infra setup into a shared helper that accepts a List<Column> and returns a Table.
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
| void testUnconfiguredAssetTypeFallbackMatchesViaDataAsset(TestNamespace ns) throws Exception { | ||
| OpenMetadataClient client = SdkClients.adminClient(); | ||
| String name = ns.prefix("glossary_unconfigured_fallback"); | ||
| CreateGlossary req = new CreateGlossary().withName(name).withDescription(name); | ||
| Glossary glossary = client.glossaries().create(req); | ||
| assertNotNull(glossary); | ||
|
|
||
| Awaitility.await() | ||
| .atMost(90, TimeUnit.SECONDS) | ||
| .pollInterval(500, TimeUnit.MILLISECONDS) | ||
| .until(() -> totalHitsForIndex(client, name, "dataAsset") >= 1); | ||
|
|
||
| long total = totalHitsForIndex(client, name, "dataAsset"); | ||
| assertTrue( | ||
| total >= 1, |
There was a problem hiding this comment.
💡 Quality: Glossary test doesn't clean up created glossary entity
testUnconfiguredAssetTypeFallbackMatchesViaDataAsset creates a glossary via client.glossaries().create(req) but never deletes it. The other helpers use TestNamespace-scoped names that presumably get cleaned up, but this glossary is created ad-hoc. If the test framework doesn't auto-clean glossaries, this is a resource leak across test runs. Verify the cleanup strategy or add an @AfterEach / try-finally deletion.
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
…n totals Refactor the `dataAsset`/`all` alias query path so each entity-type bucket in the aggregation matches what its dedicated index returns for the same query. The composite asset config used to merge fields from every type, then apply phrase/ngram-fuzzy semantics to all docs; column docs got semantics different from `buildColumnSearchBuilderV2`, which is why the explore search bar's two calls disagreed on the tableColumn count. The new `buildAllAssetsSearchBuilderV2` builds a per-entity-type bool union: each clause is `filter(entityType=<type>) must(<type's own query>)`. The column branch reuses `buildColumnMultiMatchV2`; every other type goes through `buildBaseQueryV2` with its dedicated config; an extra `should` covers asset types in the `dataAsset` alias that lack a config (e.g. `glossary`, `apiCollection`) using the default config. Also tightens the column builder to `Operator.And` so multi-token queries like `first_name` require every sub-token to match somewhere — fixes the `om_analyzer`-driven over-match where the lenient `Or` + `min_should_match=0` variant matched any column whose name contained just `first` or just `name`. `ColumnSearchIndex.getFields()` gains `name.ngram`, `name.compound`, `displayName.ngram`, `displayName.compound` so prefix queries like `fir` still match column docs from both `index=tableColumn` and the dataAsset bucket. Issue: open-metadata/openmetadata-collate#3851 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds integration tests under ColumnSearchIndexIT that pin the behavior of the fix in PR #27846: - testDataAssetTableColumnAggregationMatchesTableColumnTotal: dataAsset bucket count for tableColumn equals index=tableColumn total for a multi-token query against seeded columns. - testColumnQueryRequiresAllSubtokensToMatch: query "<tag>_first_name" must match the seeded "<tag>_first_name" column but NOT "<tag>_first_id" — pins the Operator.And fix that closed the om_analyzer sub-token over-match. - testDataAssetTableBucketMatchesTableIndexTotal: same parity guarantee for the "table" entity-type bucket, exercising the per-type-union path for a non-column type. - testPrefixQueryMatchesViaNgramOnBothPaths: short prefix queries (e.g. the first few chars of the seeded tag) must match seeded columns via name.ngram and stay in parity across both endpoints. - testUnconfiguredAssetTypeFallbackMatchesViaDataAsset: a Glossary doc (an asset type without an explicit searchSettings.assetTypeConfigurations entry) must still surface via index=dataAsset, exercising the fallback should clause in buildPerTypeUnionQueryV2. Issue: open-metadata/openmetadata-collate#3851 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- testTopicBucketAndResultSetMatchIndexTopic: pins parity for a non-column entity type at both the count and the FQN-set level, exercising the per-type-union path for `topic`. The dataAsset alias must return the same set of topic FQNs that index=topic returns for the same query. - testComplexSyntaxQueriesKeepParity: runs four representative complex- syntax shapes (quoted phrase, AND, OR, mixed parenthesised) through both endpoints and asserts the tableColumn bucket count equals index=tableColumn total. Issue: open-metadata/openmetadata-collate#3851 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| for (String configured : configuredTypes) { | ||
| fallback.mustNot(ElasticQueryBuilder.termQuery(ENTITY_TYPE_FIELD, configured)); |
There was a problem hiding this comment.
buildUnconfiguredAssetFallbackV2 currently emits one mustNot term(entityType=...) per configured type. With many asset types this can create a large bool query and add overhead. Use a single mustNot with a terms query over configuredTypes instead (ElasticQueryBuilder supports termsQuery).
| for (String configured : configuredTypes) { | |
| fallback.mustNot(ElasticQueryBuilder.termQuery(ENTITY_TYPE_FIELD, configured)); | |
| if (!configuredTypes.isEmpty()) { | |
| fallback.mustNot(ElasticQueryBuilder.termsQuery(ENTITY_TYPE_FIELD, configuredTypes)); |
|
|
||
| String multiTokenQuery = tag + "_first " + tag + "_address"; | ||
|
|
||
| Awaitility.await() | ||
| .atMost(90, TimeUnit.SECONDS) | ||
| .pollInterval(500, TimeUnit.MILLISECONDS) | ||
| .until( | ||
| () -> { | ||
| String r = | ||
| client | ||
| .search() | ||
| .query(multiTokenQuery) | ||
| .index("tableColumn") | ||
| .size(0) | ||
| .deleted(false) | ||
| .execute(); | ||
| JsonNode root = OBJECT_MAPPER.readTree(r); | ||
| long total = root.path("hits").path("total").path("value").asLong(-1); | ||
| return total >= 3; |
There was a problem hiding this comment.
The awaited query tag + "_first " + tag + "_address" is unlikely to ever reach total >= 3 now that the column builder uses a multi_match with Operator.And (each hit must contain all analyzed tokens, so no single column will match both "first" and "*_address"). This can make the Awaitility wait time out and fail the test. Use a query that is expected to match at least one seeded column under the new AND semantics (e.g., target a single column name) and wait for that expected minimum instead of >= 3.
| JsonNode aggBuckets = | ||
| OBJECT_MAPPER | ||
| .readTree(aggResponse) | ||
| .path("aggregations") | ||
| .path("sterms#entityType") | ||
| .path("buckets"); |
There was a problem hiding this comment.
This code assumes the entity-type aggregation is always under aggregations.sterms#entityType. In this repo, other integration tests account for responses that use either entityType or sterms#entityType depending on backend/aggregation builder. To avoid backend-specific failures, locate the aggregation node by checking both keys (or factor this into a shared helper).
| JsonNode aggBuckets = | |
| OBJECT_MAPPER | |
| .readTree(aggResponse) | |
| .path("aggregations") | |
| .path("sterms#entityType") | |
| .path("buckets"); | |
| JsonNode aggregations = OBJECT_MAPPER.readTree(aggResponse).path("aggregations"); | |
| JsonNode entityTypeAggregation = aggregations.path("entityType"); | |
| if (entityTypeAggregation.isMissingNode()) { | |
| entityTypeAggregation = aggregations.path("sterms#entityType"); | |
| } | |
| JsonNode aggBuckets = entityTypeAggregation.path("buckets"); |
| Awaitility.await() | ||
| .atMost(90, TimeUnit.SECONDS) | ||
| .pollInterval(500, TimeUnit.MILLISECONDS) | ||
| .until(() -> totalHitsForIndex(client, prefixQuery, "tableColumn") >= 5); |
There was a problem hiding this comment.
The Awaitility condition totalHitsForIndex(...) >= 5 looks too high for the data created in this test (the helper creates 3 columns). If the random tag is unique (likely), the prefix query may only ever match those seeded columns, causing an unnecessary timeout/flaky failure. Consider waiting for the expected minimum (e.g., >= 3) or simply > 0.
| .until(() -> totalHitsForIndex(client, prefixQuery, "tableColumn") >= 5); | |
| .until(() -> totalHitsForIndex(client, prefixQuery, "tableColumn") >= 3); |
| long aggTopicBucket = bucketCountFromDataAsset(client, tag, Entity.TOPIC); | ||
| assertEquals( | ||
| topicTotal, | ||
| aggTopicBucket, | ||
| "dataAsset topic bucket must equal index=topic total for query " + tag); | ||
|
|
There was a problem hiding this comment.
Helper bucketCountFromDataAsset hard-codes aggregations.sterms#entityType. Elsewhere (e.g., SearchResourceIT) the code handles both entityType and sterms#entityType. To prevent backend-dependent failures, update this helper to select whichever aggregation key is present.
| for (String configured : configuredTypes) { | ||
| fallback.mustNot(OpenSearchQueryBuilder.termQuery(ENTITY_TYPE_FIELD, configured)); |
There was a problem hiding this comment.
buildUnconfiguredAssetFallbackV2 adds one mustNot term(entityType=...) clause per configured type. With many configured asset types this can bloat the query and slow execution. Prefer a single mustNot with a terms query over the configuredTypes set (the query builder already supports termsQuery).
| for (String configured : configuredTypes) { | |
| fallback.mustNot(OpenSearchQueryBuilder.termQuery(ENTITY_TYPE_FIELD, configured)); | |
| if (!configuredTypes.isEmpty()) { | |
| fallback.mustNot( | |
| OpenSearchQueryBuilder.termsQuery( | |
| ENTITY_TYPE_FIELD, | |
| configuredTypes.stream().map(FieldValue::of).toList())); |
…n totals Refactor the `dataAsset`/`all` alias query path so each entity-type bucket in the aggregation matches what its dedicated index returns for the same query. The composite asset config used to merge fields from every type, then apply phrase/ngram-fuzzy semantics to all docs; column docs got semantics different from `buildColumnSearchBuilderV2`, which is why the explore search bar's two calls disagreed on the tableColumn count. The new `buildAllAssetsSearchBuilderV2` builds a per-entity-type bool union: each clause is `filter(entityType=<type>) must(<type's own query>)`. The column branch reuses `buildColumnMultiMatchV2`; every other type goes through `buildBaseQueryV2` with its dedicated config; an extra `should` covers asset types in the `dataAsset` alias that lack a config (e.g. `glossary`, `apiCollection`) using the default config. Also tightens the column builder to `Operator.And` so multi-token queries like `first_name` require every sub-token to match somewhere — fixes the `om_analyzer`-driven over-match where the lenient `Or` + `min_should_match=0` variant matched any column whose name contained just `first` or just `name`. `ColumnSearchIndex.getFields()` gains `name.ngram`, `name.compound`, `displayName.ngram`, `displayName.compound` so prefix queries like `fir` still match column docs from both `index=tableColumn` and the dataAsset bucket. Adds integration tests in ColumnSearchIndexIT covering: - multi-token query parity for the tableColumn bucket vs index=tableColumn - sub-token over-match guard (query `<tag>_first_name` must not match a seeded `<tag>_first_id` column) - table bucket parity (count) and topic parity (count + FQN-set match) - prefix queries via `name.ngram` keeping parity across both endpoints - complex-syntax queries (quoted phrase, AND/OR, mixed parenthesised) keeping the tableColumn bucket parity - unconfigured asset type fallback (Glossary docs reachable via dataAsset) Issue: open-metadata/openmetadata-collate#3851 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| JsonNode buckets = | ||
| OBJECT_MAPPER | ||
| .readTree(response) | ||
| .path("aggregations") | ||
| .path("sterms#entityType") | ||
| .path("buckets"); |
There was a problem hiding this comment.
bucketCountFromDataAsset assumes the entity-type aggregation lives under aggregations.sterms#entityType, but other integration tests in this repo already need to support aggregations.entityType as well. This helper should check for both keys; otherwise it will silently return 0 and break the parity assertions depending on search backend/serialization.
| JsonNode buckets = | |
| OBJECT_MAPPER | |
| .readTree(response) | |
| .path("aggregations") | |
| .path("sterms#entityType") | |
| .path("buckets"); | |
| JsonNode root = OBJECT_MAPPER.readTree(response); | |
| JsonNode aggregations = root.path("aggregations"); | |
| JsonNode buckets = aggregations.path("sterms#entityType").path("buckets"); | |
| if (buckets.isMissingNode() || !buckets.isArray()) { | |
| buckets = aggregations.path("entityType").path("buckets"); | |
| } |
| String multiTokenQuery = tag + "_first " + tag + "_address"; | ||
|
|
||
| Awaitility.await() | ||
| .atMost(90, TimeUnit.SECONDS) | ||
| .pollInterval(500, TimeUnit.MILLISECONDS) | ||
| .until( | ||
| () -> { | ||
| String r = | ||
| client | ||
| .search() | ||
| .query(multiTokenQuery) | ||
| .index("tableColumn") | ||
| .size(0) | ||
| .deleted(false) | ||
| .execute(); | ||
| JsonNode root = OBJECT_MAPPER.readTree(r); | ||
| long total = root.path("hits").path("total").path("value").asLong(-1); | ||
| return total >= 3; | ||
| }); |
There was a problem hiding this comment.
multiTokenQuery is constructed as two different column-name prefixes separated by a space (<tag>_first <tag>_address). With the updated column multi-match using Operator.And, this query is unlikely to match any single column document (no column contains both “first” and “address”), which would cause the Awaitility block to time out and make the test fail. Use a query that is expected to match at least one seeded column doc (e.g., the actual <tag>_first_name term or another query where all required tokens are present in the same column).
| * into {@code [first, name]}; with the old {@code Operator.Or} + {@code min_should_match=0} | ||
| * column builder, a column called {@code <tag>_first_id} matched a query of {@code | ||
| * <tag>_first_name} because the single token {@code first} was enough. The fix moves the | ||
| * column builder to {@code Operator.And}, so every sub-token must match. |
There was a problem hiding this comment.
The comment says the old column builder used Operator.Or + min_should_match=0, but the implementation was passing the string "0" as the fuzziness parameter (see ElasticQueryBuilder/OpenSearchQueryBuilder.multiMatchQuery(..., fuzziness)). minimum_should_match was not set at all when fuzziness was "0". Please update the comment to reflect the actual behavior so future debugging isn’t misled.
| * into {@code [first, name]}; with the old {@code Operator.Or} + {@code min_should_match=0} | |
| * column builder, a column called {@code <tag>_first_id} matched a query of {@code | |
| * <tag>_first_name} because the single token {@code first} was enough. The fix moves the | |
| * column builder to {@code Operator.And}, so every sub-token must match. | |
| * into {@code [first, name]}; with the old {@code Operator.Or} column builder, the string | |
| * {@code "0"} was passed as the fuzziness value rather than setting {@code | |
| * minimum_should_match}, so a column called {@code <tag>_first_id} matched a query of | |
| * {@code <tag>_first_name} because the single token {@code first} was enough. The fix moves | |
| * the column builder to {@code Operator.And}, so every sub-token must match. |
| Awaitility.await() | ||
| .atMost(90, TimeUnit.SECONDS) | ||
| .pollInterval(500, TimeUnit.MILLISECONDS) | ||
| .until(() -> totalHitsForIndex(client, tag, "tableColumn") >= 5); |
There was a problem hiding this comment.
Similar to the prefix test, this Awaitility condition requires totalHitsForIndex(tag, "tableColumn") >= 5 even though the table fixture seeds 3 columns. In a properly isolated namespace this is likely to never reach 5 and will make the test flaky/fail. Lower the threshold to what the fixture guarantees (e.g., >= 1 or >= 3) or wait for a specific known seeded column to appear.
| .until(() -> totalHitsForIndex(client, tag, "tableColumn") >= 5); | |
| .until(() -> totalHitsForIndex(client, tag, "tableColumn") >= 3); |
Bug fixes the reviewer flagged: - buildPerTypeUnionQueryV2 (both factories) now guards against searchSettings.getAssetTypeConfigurations() returning null/empty by falling back to the default config so dataAsset/all queries can't NPE if the search settings haven't been initialized. - testDataAssetTableColumnAggregationMatchesTableColumnTotal previously awaited a multi-token query that no single seeded column could satisfy under the new Operator.And semantics, leaving the parity assertion trivially 0 == 0; switched to a query that actually matches multiple seeded columns. - testColumnQueryRequiresAllSubtokensToMatch was relying on a column name (`<tag>_first_id`) that was never created, so the negative assertion was trivially true. createTableWithMultiTokenColumns now seeds first_id (and alpha/bravo columns used by the complex-syntax and prefix tests) so all the parity assertions exercise real indexed documents. - bucketCountFromDataAsset accepts both `entityType` and `sterms#entityType` aggregation key shapes so the helper doesn't break on backends that label the bucket differently. - Javadoc on buildColumnMultiMatchV2 (both factories) corrected to describe the actual previous shape (`Operator.Or` + `fuzziness="0"`, which left minimum_should_match unset) instead of the inaccurate `min_should_match=0`. Issue: open-metadata/openmetadata-collate#3851 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| return SdkClients.adminClient().tables().create(tableRequest); | ||
| } | ||
|
|
||
| private static final int MULTI_TOKEN_SEED_COUNT = 7; |
There was a problem hiding this comment.
💡 Quality: MULTI_TOKEN_SEED_COUNT constant is defined but never used
The constant MULTI_TOKEN_SEED_COUNT = 7 was added at line 916 but is never referenced anywhere in the file. The magic numbers 5 in the await conditions (e.g., >= 5 at lines 505 and 605) and 2 at line 396 appear to be the intended use sites but still use raw literals. Either use the constant or remove it.
Suggested fix:
Either replace the magic numbers with the constant:
.until(() -> totalHitsForIndex(client, tag, "tableColumn") >= MULTI_TOKEN_SEED_COUNT - 2);
or remove the unused constant if it was only added for documentation purposes.
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
Adds a `termsQuery(String, Collection<String>)` helper to both OpenSearchQueryBuilder and ElasticQueryBuilder, then uses it in buildUnconfiguredAssetFallbackV2 to replace the per-type `mustNot term(entityType=...)` chain with a single `mustNot terms(entityType=[...])`. Keeps the bool query small regardless of how many configured asset types exist — the previous shape grew one clause per configured type. Issue: open-metadata/openmetadata-collate#3851 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code Review 👍 Approved with suggestions 4 resolved / 7 findingsAligns dataAsset aggregation counts with index totals while resolving several test and null-pointer edge cases. Please refactor the redundant boilerplate in 💡 Quality: createTableWithMultiTokenColumns duplicates boilerplate from other helpers📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/ColumnSearchIndexIT.java:764-778 📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/ColumnSearchIndexIT.java:759-762 The new 💡 Quality: Glossary test doesn't clean up created glossary entity
💡 Quality: MULTI_TOKEN_SEED_COUNT constant is defined but never used📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/ColumnSearchIndexIT.java:916 The constant Suggested fix✅ 4 resolved✅ Bug: Sub-token over-match test is trivially true (missing first_id column)
✅ Bug: NPE if getAssetTypeConfigurations() returns null
✅ Bug: Complex-syntax test waits for ≥5 columns but only 3 are created
✅ Edge Case: Complex-syntax quoted-phrase query may match zero hits trivially
🤖 Prompt for agentsOptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|
🔴 Playwright Results — 156 failure(s), 16 flaky✅ 1494 passed · ❌ 156 failed · 🟡 16 flaky · ⏭️ 152 skipped
Genuine Failures (failed on all attempts)❌
|



Summary
dataAsset/allalias through a newbuildAllAssetsSearchBuilderV2, which builds a per-entity-type bool union — each clause isfilter(entityType=<type>) must(<type's own query>). ThetableColumnbranch reuses the column builder; every other type goes through its dedicated asset config; a fallbackshouldcovers types in thedataAssetalias without a config (e.g.glossary,apiCollection). Each entity-type bucket count therefore equals what the dedicated index returns for the same query, by construction.buildColumnSearchBuilderV2toOperator.Andso multi-token queries likefirst_namerequire every analyzer sub-token to match somewhere (the previousOr+min_should_match=0over-matched on any single sub-token).name.ngram,name.compound,displayName.ngram,displayName.compoundtoColumnSearchIndex.getFields()so prefix queries (e.g.fir) still match column docs from bothindex=tableColumnand the dataAsset bucket.Issue: open-metadata/openmetadata-collate#3851
Reproduction / verification
scripts/reproduce_column_agg_mismatch.shis a multi-query probe that exits non-zero on any divergence between the dataAsset aggregation bucket and theindex=tableColumntotal. After this PR, all probed queries return matching counts:Tests covering bucket-parity for
tableColumn/tableand the sub-token over-match guard are coming in a follow-up commit.Test plan
./scripts/reproduce_column_agg_mismatch.sh— all probed queries returnOK.first_name, confirm the tableColumn tab badge equals the entity-type aggregation panel count.fir— column results appear, and the bucket count matches the tableColumn tab.index=table,index=topic, etc.) still behave as before.🤖 Generated with Claude Code
Summary by Gitar
buildUnconfiguredAssetFallbackV2in bothElasticSearchSourceBuilderFactoryandOpenSearchSourceBuilderFactoryby replacing iterativemustNotterm queries with a single, more efficienttermsQuery.termsQuerytoElasticQueryBuilderandOpenSearchQueryBuilderto support batch filtering of entity types.This will update automatically on new commits.