Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,11 @@

import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.concurrent.TimeUnit;
import org.awaitility.Awaitility;
import org.junit.jupiter.api.DisplayName;
import org.junit.jupiter.api.Nested;
import org.junit.jupiter.api.Test;
Expand All @@ -33,15 +36,18 @@
import org.openmetadata.it.util.TestNamespaceExtension;
import org.openmetadata.schema.api.data.CreateDatabase;
import org.openmetadata.schema.api.data.CreateDatabaseSchema;
import org.openmetadata.schema.api.data.CreateGlossary;
import org.openmetadata.schema.api.data.CreateTable;
import org.openmetadata.schema.entity.data.Database;
import org.openmetadata.schema.entity.data.DatabaseSchema;
import org.openmetadata.schema.entity.data.Glossary;
import org.openmetadata.schema.entity.data.Table;
import org.openmetadata.schema.entity.services.DatabaseService;
import org.openmetadata.schema.type.Column;
import org.openmetadata.schema.type.ColumnDataType;
import org.openmetadata.sdk.client.OpenMetadataClient;
import org.openmetadata.sdk.fluent.DatabaseServices;
import org.openmetadata.service.Entity;

/**
* Integration tests for column search indexing during table reindexing. Verifies:
Expand Down Expand Up @@ -353,6 +359,268 @@ void testColumnFilterByDatabase(TestNamespace ns) throws Exception {
}
}

@Nested
@DisplayName("DataAsset/TableColumn Aggregation Parity Tests")
@Execution(ExecutionMode.CONCURRENT)
class DataAssetColumnAggregationTests {

/**
* Regression for github.com/open-metadata/openmetadata-collate/issues/3851. The UI fires two
* requests when the user types a multi-word query: {@code index=dataAsset&size=0} for the
* entity-type aggregation and {@code index=tableColumn} for the column hits. They route through
* different builders in {@code OpenSearchSourceBuilderFactory}: {@code tableColumn} uses the
* dedicated lenient column builder while {@code dataAsset} uses the composite asset config with
* stricter phrase/{@code 2<70%} matching. Without the fix, the {@code tableColumn} bucket in the
* {@code dataAsset} aggregation under-counts the column index hits.
*/
@Test
@DisplayName(
"Aggregation bucket for tableColumn under dataAsset must match index=tableColumn total")
void testDataAssetTableColumnAggregationMatchesTableColumnTotal(TestNamespace ns)
throws Exception {
OpenMetadataClient client = SdkClients.adminClient();

String tag = ns.shortPrefix();
Table table = createTableWithMultiTokenColumns(ns, "agg_parity_" + tag, tag);
assertNotNull(table);

String multiTokenQuery = tag + "_first " + tag + "_address";

Awaitility.await()
.atMost(90, TimeUnit.SECONDS)
.pollInterval(500, TimeUnit.MILLISECONDS)
.until(
() -> {
String r =
client
.search()
.query(multiTokenQuery)
.index("tableColumn")
.size(0)
.deleted(false)
.execute();
JsonNode root = OBJECT_MAPPER.readTree(r);
long total = root.path("hits").path("total").path("value").asLong(-1);
return total >= 3;
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The awaited query tag + "_first " + tag + "_address" is unlikely to ever reach total >= 3 now that the column builder uses a multi_match with Operator.And (each hit must contain all analyzed tokens, so no single column will match both "first" and "*_address"). This can make the Awaitility wait time out and fail the test. Use a query that is expected to match at least one seeded column under the new AND semantics (e.g., target a single column name) and wait for that expected minimum instead of >= 3.

Copilot uses AI. Check for mistakes.
});
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiTokenQuery is constructed as two different column-name prefixes separated by a space (<tag>_first <tag>_address). With the updated column multi-match using Operator.And, this query is unlikely to match any single column document (no column contains both “first” and “address”), which would cause the Awaitility block to time out and make the test fail. Use a query that is expected to match at least one seeded column doc (e.g., the actual <tag>_first_name term or another query where all required tokens are present in the same column).

Copilot uses AI. Check for mistakes.

String columnResponse =
client
.search()
.query(multiTokenQuery)
.index("tableColumn")
.size(0)
.deleted(false)
.execute();
long columnTotal =
OBJECT_MAPPER.readTree(columnResponse).path("hits").path("total").path("value").asLong();

String aggResponse =
client
.search()
.query(multiTokenQuery)
.index("dataAsset")
.size(0)
.deleted(false)
.execute();
JsonNode aggBuckets =
OBJECT_MAPPER
.readTree(aggResponse)
.path("aggregations")
.path("sterms#entityType")
.path("buckets");
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code assumes the entity-type aggregation is always under aggregations.sterms#entityType. In this repo, other integration tests account for responses that use either entityType or sterms#entityType depending on backend/aggregation builder. To avoid backend-specific failures, locate the aggregation node by checking both keys (or factor this into a shared helper).

Suggested change
JsonNode aggBuckets =
OBJECT_MAPPER
.readTree(aggResponse)
.path("aggregations")
.path("sterms#entityType")
.path("buckets");
JsonNode aggregations = OBJECT_MAPPER.readTree(aggResponse).path("aggregations");
JsonNode entityTypeAggregation = aggregations.path("entityType");
if (entityTypeAggregation.isMissingNode()) {
entityTypeAggregation = aggregations.path("sterms#entityType");
}
JsonNode aggBuckets = entityTypeAggregation.path("buckets");

Copilot uses AI. Check for mistakes.
long aggColumnCount = 0;
for (JsonNode bucket : aggBuckets) {
if ("tableColumn".equals(bucket.path("key").asText())) {
aggColumnCount = bucket.path("doc_count").asLong();
break;
}
}

assertEquals(
columnTotal,
aggColumnCount,
"dataAsset aggregation tableColumn bucket ("
+ aggColumnCount
+ ") must match index=tableColumn total ("
+ columnTotal
+ ") for query \""
+ multiTokenQuery
+ "\"");
}

/**
* Guards against the {@code _} over-match: the {@code om_analyzer} splits {@code first_name}
* into {@code [first, name]}; with the old {@code Operator.Or} + {@code min_should_match=0}
* column builder, a column called {@code <tag>_first_id} matched a query of {@code
* <tag>_first_name} because the single token {@code first} was enough. The fix moves the
* column builder to {@code Operator.And}, so every sub-token must match.
Comment on lines +418 to +421
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says the old column builder used Operator.Or + min_should_match=0, but the implementation was passing the string "0" as the fuzziness parameter (see ElasticQueryBuilder/OpenSearchQueryBuilder.multiMatchQuery(..., fuzziness)). minimum_should_match was not set at all when fuzziness was "0". Please update the comment to reflect the actual behavior so future debugging isn’t misled.

Suggested change
* into {@code [first, name]}; with the old {@code Operator.Or} + {@code min_should_match=0}
* column builder, a column called {@code <tag>_first_id} matched a query of {@code
* <tag>_first_name} because the single token {@code first} was enough. The fix moves the
* column builder to {@code Operator.And}, so every sub-token must match.
* into {@code [first, name]}; with the old {@code Operator.Or} column builder, the string
* {@code "0"} was passed as the fuzziness value rather than setting {@code
* minimum_should_match}, so a column called {@code <tag>_first_id} matched a query of
* {@code <tag>_first_name} because the single token {@code first} was enough. The fix moves
* the column builder to {@code Operator.And}, so every sub-token must match.

Copilot uses AI. Check for mistakes.
*/
@Test
@DisplayName("Column query for first_name must not match a column called first_id")
void testColumnQueryRequiresAllSubtokensToMatch(TestNamespace ns) throws Exception {
OpenMetadataClient client = SdkClients.adminClient();
String tag = ns.shortPrefix();
Table table = createTableWithMultiTokenColumns(ns, "subtoken_" + tag, tag);
assertNotNull(table);

String firstNameColumn = tag + "_first_name";
String firstIdColumn = tag + "_first_id";

Awaitility.await()
.atMost(90, TimeUnit.SECONDS)
.pollInterval(500, TimeUnit.MILLISECONDS)
.until(() -> hitsForColumnQuery(client, firstNameColumn).contains(firstNameColumn));

Set<String> hits = hitsForColumnQuery(client, firstNameColumn);
Comment thread
gitar-bot[bot] marked this conversation as resolved.
assertTrue(
hits.contains(firstNameColumn),
"Query " + firstNameColumn + " must match the column with the same name; got " + hits);
assertFalse(
hits.contains(firstIdColumn),
"Query "
+ firstNameColumn
+ " must not match column "
+ firstIdColumn
+ " (only the 'first' sub-token overlaps); got "
+ hits);
}

/**
* Same parity guarantee that {@link #testDataAssetTableColumnAggregationMatchesTableColumnTotal}
* pins for {@code tableColumn}, but for the {@code table} bucket. The per-type-union path must
* make every entity-type bucket equal to what the dedicated index returns.
*/
@Test
@DisplayName("Aggregation bucket for table under dataAsset must match index=table total")
void testDataAssetTableBucketMatchesTableIndexTotal(TestNamespace ns) throws Exception {
OpenMetadataClient client = SdkClients.adminClient();
String tag = ns.shortPrefix();
String tableQueryTag = "tblparity" + tag;
Table table = createTableWithMultiTokenColumns(ns, tableQueryTag, tag);
assertNotNull(table);

Awaitility.await()
.atMost(90, TimeUnit.SECONDS)
.pollInterval(500, TimeUnit.MILLISECONDS)
.until(() -> totalHitsForIndex(client, table.getName(), "table") >= 1);

long tableTotal = totalHitsForIndex(client, table.getName(), "table");
long aggTableBucket = bucketCountFromDataAsset(client, table.getName(), "table");

assertEquals(
tableTotal,
aggTableBucket,
"dataAsset aggregation table bucket must equal index=table total for query "
+ table.getName());
}

/**
* Pins prefix-style matching so it stays aligned across the two paths after the per-type-union
* refactor. Production behavior for short queries like {@code fir} relies on {@code name.ngram}
* and {@code name.compound}, which are now part of {@link
* org.openmetadata.service.search.indexes.ColumnSearchIndex#getFields()}; the dataAsset bucket
* for {@code tableColumn} must produce the same total as {@code index=tableColumn} for the
* same prefix query, and the seeded column must be in both result sets.
*/
@Test
@DisplayName("Prefix queries match via ngram and stay in parity across both paths")
void testPrefixQueryMatchesViaNgramOnBothPaths(TestNamespace ns) throws Exception {
OpenMetadataClient client = SdkClients.adminClient();
String tag = ns.shortPrefix();
Table table = createTableWithMultiTokenColumns(ns, "ngram_" + tag, tag);
assertNotNull(table);

String prefixQuery = tag.substring(0, Math.min(4, tag.length()));

Awaitility.await()
.atMost(90, TimeUnit.SECONDS)
.pollInterval(500, TimeUnit.MILLISECONDS)
.until(() -> totalHitsForIndex(client, prefixQuery, "tableColumn") >= 5);
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Awaitility condition totalHitsForIndex(...) >= 5 looks too high for the data created in this test (the helper creates 3 columns). If the random tag is unique (likely), the prefix query may only ever match those seeded columns, causing an unnecessary timeout/flaky failure. Consider waiting for the expected minimum (e.g., >= 3) or simply > 0.

Suggested change
.until(() -> totalHitsForIndex(client, prefixQuery, "tableColumn") >= 5);
.until(() -> totalHitsForIndex(client, prefixQuery, "tableColumn") >= 3);

Copilot uses AI. Check for mistakes.

long columnTotal = totalHitsForIndex(client, prefixQuery, "tableColumn");
long aggColumnBucket = bucketCountFromDataAsset(client, prefixQuery, Entity.TABLE_COLUMN);

assertTrue(
columnTotal > 0,
"Prefix query " + prefixQuery + " should match seeded columns via name.ngram");
assertEquals(
columnTotal,
aggColumnBucket,
"dataAsset tableColumn bucket ("
+ aggColumnBucket
+ ") must match index=tableColumn total ("
+ columnTotal
+ ") for prefix query "
+ prefixQuery);
}

/**
* Asset types that live in the {@code dataAsset} alias but lack a {@code
* searchSettings.assetTypeConfigurations} entry (e.g. {@code glossary}) used to be matched via
* the composite-config field merge. The per-type-union path adds a fallback {@code should}
* clause for these; this test pins that a glossary doc still appears in {@code
* index=dataAsset} hits when its name matches the query.
*/
@Test
@DisplayName("Asset types without dedicated config (glossary) still match via dataAsset alias")
void testUnconfiguredAssetTypeFallbackMatchesViaDataAsset(TestNamespace ns) throws Exception {
OpenMetadataClient client = SdkClients.adminClient();
String name = ns.prefix("glossary_unconfigured_fallback");
CreateGlossary req = new CreateGlossary().withName(name).withDescription(name);
Glossary glossary = client.glossaries().create(req);
assertNotNull(glossary);

Awaitility.await()
.atMost(90, TimeUnit.SECONDS)
.pollInterval(500, TimeUnit.MILLISECONDS)
.until(() -> totalHitsForIndex(client, name, "dataAsset") >= 1);

long total = totalHitsForIndex(client, name, "dataAsset");
assertTrue(
total >= 1,
Comment on lines +531 to +545
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: Glossary test doesn't clean up created glossary entity

testUnconfiguredAssetTypeFallbackMatchesViaDataAsset creates a glossary via client.glossaries().create(req) but never deletes it. The other helpers use TestNamespace-scoped names that presumably get cleaned up, but this glossary is created ad-hoc. If the test framework doesn't auto-clean glossaries, this is a resource leak across test runs. Verify the cleanup strategy or add an @AfterEach / try-finally deletion.

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

"Glossary doc with name "
+ name
+ " must be reachable via index=dataAsset (fallback clause for unconfigured types)");
}

private Set<String> hitsForColumnQuery(OpenMetadataClient client, String query)
throws Exception {
String response =
client.search().query(query).index("tableColumn").size(50).deleted(false).execute();
JsonNode hits = OBJECT_MAPPER.readTree(response).path("hits").path("hits");
Set<String> names = new HashSet<>();
for (JsonNode hit : hits) {
names.add(hit.path("_source").path("name").asText());
}
return names;
}

private long totalHitsForIndex(OpenMetadataClient client, String query, String index)
throws Exception {
String response = client.search().query(query).index(index).size(0).deleted(false).execute();
return OBJECT_MAPPER.readTree(response).path("hits").path("total").path("value").asLong();
}

private long bucketCountFromDataAsset(
OpenMetadataClient client, String query, String entityType) throws Exception {
String response =
client.search().query(query).index("dataAsset").size(0).deleted(false).execute();
JsonNode buckets =
OBJECT_MAPPER
.readTree(response)
.path("aggregations")
.path("sterms#entityType")
.path("buckets");
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bucketCountFromDataAsset assumes the entity-type aggregation lives under aggregations.sterms#entityType, but other integration tests in this repo already need to support aggregations.entityType as well. This helper should check for both keys; otherwise it will silently return 0 and break the parity assertions depending on search backend/serialization.

Suggested change
JsonNode buckets =
OBJECT_MAPPER
.readTree(response)
.path("aggregations")
.path("sterms#entityType")
.path("buckets");
JsonNode root = OBJECT_MAPPER.readTree(response);
JsonNode aggregations = root.path("aggregations");
JsonNode buckets = aggregations.path("sterms#entityType").path("buckets");
if (buckets.isMissingNode() || !buckets.isArray()) {
buckets = aggregations.path("entityType").path("buckets");
}

Copilot uses AI. Check for mistakes.
for (JsonNode bucket : buckets) {
if (entityType.equals(bucket.path("key").asText())) {
return bucket.path("doc_count").asLong();
}
}
return 0L;
}
}

@Nested
@DisplayName("Nested Column Tests")
@Execution(ExecutionMode.CONCURRENT)
Expand Down Expand Up @@ -493,6 +761,53 @@ private Table createTableWithColumns(TestNamespace ns, String baseName) {
return SdkClients.adminClient().tables().create(tableRequest);
}

private Table createTableWithMultiTokenColumns(TestNamespace ns, String baseName, String tag) {
String shortId = ns.shortPrefix();

org.openmetadata.schema.services.connections.database.PostgresConnection conn =
DatabaseServices.postgresConnection().hostPort("localhost:5432").username("test").build();

DatabaseService dbService =
DatabaseServices.builder()
.name("agg_svc_" + shortId + "_" + baseName)
.connection(conn)
.description("Test service for dataAsset/tableColumn aggregation parity")
.create();

CreateDatabase dbReq = new CreateDatabase();
dbReq.setName("agg_db_" + shortId + "_" + baseName);
Comment on lines +852 to +866
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: createTableWithMultiTokenColumns duplicates boilerplate from other helpers

The new createTableWithMultiTokenColumns method (lines 764-809) repeats ~30 lines of service/database/schema creation that are identical to createTableWithColumns, createTableWithNestedColumns, and createTableWithDeeplyNestedColumns. Per the project guidelines (no duplication, extract shared logic), consider extracting the common infra setup into a shared helper that accepts a List<Column> and returns a Table.

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

dbReq.setService(dbService.getFullyQualifiedName());
Database database = SdkClients.adminClient().databases().create(dbReq);

CreateDatabaseSchema schemaReq = new CreateDatabaseSchema();
schemaReq.setName("agg_schema_" + shortId + "_" + baseName);
schemaReq.setDatabase(database.getFullyQualifiedName());
DatabaseSchema schema = SdkClients.adminClient().databaseSchemas().create(schemaReq);

CreateTable tableRequest = new CreateTable();
tableRequest.setName(ns.prefix(baseName));
tableRequest.setDatabaseSchema(schema.getFullyQualifiedName());
tableRequest.setColumns(
List.of(
new Column()
.withName(tag + "_first_name")
.withDataType(ColumnDataType.VARCHAR)
.withDataLength(255)
.withDescription("First name"),
new Column()
.withName(tag + "_last_name")
.withDataType(ColumnDataType.VARCHAR)
.withDataLength(255)
.withDescription("Last name"),
new Column()
.withName(tag + "_address")
.withDataType(ColumnDataType.VARCHAR)
.withDataLength(512)
.withDescription("Postal address")));

return SdkClients.adminClient().tables().create(tableRequest);
}

private Table createTableWithNestedColumns(TestNamespace ns, String baseName) {
String shortId = ns.shortPrefix();

Expand Down
Loading
Loading