fix(profiler): N+1 / missing-index regression on /tables/.../columns?fields=profile (#3488)#27746
fix(profiler): N+1 / missing-index regression on /tables/.../columns?fields=profile (#3488)#27746sonika-shah wants to merge 6 commits intomainfrom
Conversation
…fields=profile (#3488) Root cause ---------- The 1.9.9 migration introduced two separate index regressions on `profiler_data_time_series`: 1. **PostgreSQL**: `schemaChanges.sql` explicitly dropped the unique constraint `profiler_data_time_series_unique_hash_extension_ts` (entityFQNHash, extension, operation, timestamp) to allow altering the generated `operation` column expression, but never recreated it. After the migration the table kept only the `(extension, timestamp)` index, which is useless for queries filtering by `entityFQNHash`. 2. **MySQL/both**: `postDataMigrationSQLScript.sql` created temporary indexes (idx_pdts_entityFQNHash, idx_pdts_composite, etc.) for its bulk UPDATE pass and then dropped **all** of them, including the only index covering `entityFQNHash`. The batch query issued by `getLatestExtensionsBatch()` when `fields=profile` is requested: SELECT entityFQNHash, MAX(timestamp) FROM profiler_data_time_series WHERE entityFQNHash IN (...N hashes...) AND extension = 'table.columnProfile' GROUP BY entityFQNHash required an `(entityFQNHash, extension, timestamp)` index. Without it the database performs a full table scan. On production deployments with millions of profiler rows this caused 100+ second response times (Grafana: 106 770 ms; 99 % in DB; 93 dbOps). Without `profile` in the fields param the same endpoint returned in ~150-220 ms. A secondary N+1 bug existed independently of the index: `customMetrics` in fields called `getCustomMetrics(table, column)` once per paginated column, issuing up to N identical queries against `entity_extension` and then filtering in Java. Fix --- * **migration 2.0.2** (MySQL + PostgreSQL): `CREATE INDEX IF NOT EXISTS idx_pdts_fqnhash_ext_ts ON profiler_data_time_series(entityFQNHash, extension, timestamp)`. The `IF NOT EXISTS` guard makes the migration safe to re-run and handles both upgrade and fresh-install paths. * **`getTableColumnsInternal`** — `customMetrics` block: fetch all column custom metrics for the table in one query, group by column name in Java, then distribute. Reduces N queries to 1. * **`getTableColumnsInternal`** — `profile` block: skip the duplicate `populateEntityFieldTags` call when `tags` was already fetched earlier in the same request, saving one prefix-scan on `tag_usage` per request. Related: PR #26855 (fixed N+1 tag queries on the list-tables path but left the profiler-index and customMetrics N+1 untouched on the columns sub-path).
There was a problem hiding this comment.
Pull request overview
Addresses a performance regression on the /tables/.../columns?fields=profile path by restoring an efficient composite index for profiler lookups and removing N+1 behavior when loading column customMetrics.
Changes:
- Adds a composite index on
profiler_data_time_series(entityFQNHash, extension, timestamp)for PostgreSQL and MySQL migrations. - Updates
TableRepository#getTableColumnsInternalto fetch all column custom-metrics in one query and avoid duplicate tag population whentagsis already requested. - Adds an integration regression test covering
fields=profileand combinations withtags/customMetrics/extension.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/TableRepository.java | Batches column custom-metrics retrieval and avoids duplicate tag population when profile is requested alongside tags. |
| openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/TableResourceIT.java | Adds regression coverage for columns API with fields=profile and related combinations. |
| bootstrap/sql/migrations/native/2.0.2/postgres/schemaChanges.sql | Adds composite index to prevent profiler full-table scans on PostgreSQL. |
| bootstrap/sql/migrations/native/2.0.2/mysql/schemaChanges.sql | Adds composite index intended to prevent profiler full-table scans on MySQL. |
🟡 Playwright Results — all passed (13 flaky)✅ 3999 passed · ❌ 0 failed · 🟡 13 flaky · ⏭️ 86 skipped
🟡 13 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |
… + batch column extension/customMetrics fetch Move the migration from 2.0.2/ to 1.12.8/ and switch from a non-unique covering index to restoring the original unique constraint dropped in 1.9.9. The two-phase CREATE UNIQUE INDEX CONCURRENTLY + ADD CONSTRAINT USING INDEX pattern avoids the ACCESS EXCLUSIVE lock on the hot profiler_data_time_series table during the upgrade. Closes the 1.9.9 regression and brings Postgres back in line with MySQL (which never lost the constraint). The leading (entityFQNHash, extension) prefix serves the column-profile batch query — same shape MySQL has been running without 504s. MySQL needs no migration. Java side, eliminates two more N+1 patterns that compound the latency at customer scale: * getTableColumnsInternal extension block: replaced per-column getColumnExtension() loop with a single getExtensionsByJsonSchema() call, grouped by column FQN-hash in Java. * searchTableColumnsInternal customMetrics block: applied the same batch-fetch pattern already used in getTableColumnsInternal, replacing per-column getCustomMetrics() with one getExtensions() call. New DAO method on EntityExtensionDAO: getExtensionsByJsonSchema(id, jsonSchema) — selects extensions for a table id filtered by the jsonschema discriminator. Required because column extensions are stored with MD5-hashed extension keys and have no shared prefix the existing getExtensions(id, prefix) could use.
…pushing IN list to the join The getLatestExtensionsBatch query was the right shape for correctness but the planner — on Postgres at customer scale, with the new unique constraint in place — was still choosing a parallel sequential scan over the full profiler_data_time_series table for the outer side of the JOIN, rather than a merge join with index scan on both sides. Inner subquery: filtered by `entityFQNHash IN (...)`, used the index. Outer: only filtered by `p.extension = :extension`, no IN list, planner couldn't infer the transitive constraint that p.entityFQNHash must equal one of the inner hashes (because it's enforced through the JOIN ON clause, not a WHERE predicate). Result: full table scan reading 6.7M+ rows even when the actual answer is 23 rows. Adding the redundant `AND p.entityFQNHash IN (<entityFQNHashes>)` to the outer WHERE makes the constraint explicit. The result set is unchanged (implied by the join condition), but the planner can now use the unique index for the outer access too. Verified on the AUT dump (6.94M-row pdts): EXPLAIN of the batch query: 7,234ms → 79ms (Hash Join + Parallel Seq Scan → Merge Join + Index Only Scan). Live API /columns?fields=profile&include=all: 6-36 seconds → 22-28ms (warm) / 1.9s (very first call). 250-1000x improvement, depending on cache state. Same SQL works on both engines; no @ConnectionAwareSqlQuery split needed.
… varchar(256) The IT fixture for test_getColumnsWithProfileField_correctnessAndNoBatchRegression was building a tagFQN of `<classification>.<tag>` where each part went through TestNamespace.prefix(). With the descriptive method name (62 chars) + class name (15 chars) + namespace UUID (32 chars) plus the `profile_test_cls` / `profile_test_tag` base names (16 chars each), the resulting tagFQN was 263 characters — over the tag_usage.tagFQN VARCHAR(256) limit: ERROR: value too long for type character varying(256) Shorten the fixture base names from `profile_test_cls`/`profile_test_tag` to `cls`/`tag`. The namespace prefix already encodes test isolation (class + method + UUID), so the base name doesn't need to repeat that context. New tagFQN length: 237 chars (cls__<32>__TableResourceIT__<62>.tag__<32>__TableResourceIT__<62>), comfortably under 256.
d9f7003 to
458bc91
Compare
Code Review ✅ Approved 3 resolved / 3 findingsRestores the missing Postgres unique constraint to fix query performance and optimizes N+1 database patterns in table column retrieval. All identified issues have been resolved. ✅ 3 resolved✅ Bug: MySQL migration uses unsupported
|
| Compact |
|
Was this helpful? React with 👍 / 👎 | Gitar
| + "ON p.entityFQNHash = latest.entityFQNHash AND p.timestamp = latest.latestTs " | ||
| + "WHERE p.extension = :extension") | ||
| + "WHERE p.extension = :extension " | ||
| + "AND p.entityFQNHash IN (<entityFQNHashes>)") |
There was a problem hiding this comment.
Do you know what benefits we get in terms of performance by adding entityFqnHashes here since we already have this in the inner join table predicate.
|



Fixes collate#3488
Related to / extends #26855
Root Cause
/api/v1/tables/name/{fqn}/columns?fields=profilewas timing out at 6 minutes (504) on Postgres at scale. Three compounding regressions, all rooted in 1.9.9:1. Postgres: unique constraint dropped, never recreated
bootstrap/sql/migrations/native/1.9.9/postgres/schemaChanges.sqldroppedprofiler_data_time_series_unique_hash_extension_ts (entityFQNHash, extension, operation, timestamp)to alter the generatedoperationcolumn expression, and never recreated it. Only(extension, timestamp)remained — useless for queries filtering byentityFQNHash.MySQL was unaffected:
MODIFY COLUMNre-evaluates the generated expression in place without touching the unique constraint.2. Batch query has a planner pathology that makes the index unusable for the outer access
Even with the unique constraint restored, the original
getLatestExtensionsBatchquery:The inner subquery uses the index correctly. The outer side is filtered only by
p.extension = :ext— the planner can't see thatp.entityFQNHashmust transitively equal one of the inner hashes (the constraint is implicit in the JOIN ON, not a WHERE predicate). On Postgres this means a parallel sequential scan over the full table for the outer access, even with the index in place.3. N+1 patterns inside the columns endpoint
getTableColumnsInternalandsearchTableColumnsInternalhad three independent N+1s:customMetrics:getCustomMetrics(table, column.getName())called once per paginated column.extension:getColumnExtension(table.getId(), column.getFullyQualifiedName())called once per paginated column.tags + profiletogether:populateEntityFieldTagscalled twice (once in each branch).Fix
Postgres migration: restore the unique constraint
bootstrap/sql/migrations/native/1.12.8/postgres/schemaChanges.sql:Restoring the original 1.1.5 constraint (rather than adding a parallel narrower index) gives correctness, engine parity with MySQL, and a single index doing double duty for the query. Two-phase build avoids the
ACCESS EXCLUSIVElock.MySQL: no migration needed
The unique constraint has been continuously present since 1.1.5; 1.9.9's
MODIFY COLUMNdidn't touch it.SQL: push the IN list into the outer WHERE
CollectionDAO.ProfilerDataTimeSeriesDAO.getLatestExtensionsBatch:The added clause is logically redundant — implied by the JOIN ON which forces
p.entityFQNHash = latest.entityFQNHashand the inner WHERE which restrictslatest.entityFQNHashto the IN list. By transitivity,p.entityFQNHashis already restricted to the same list. The planner can't derive that across the JOIN boundary, so we spell it out — which lets the optimizer use the unique index for outer access too.Same SQL on both engines; no
@ConnectionAwareSqlQuerysplit.Java: batch the remaining N+1s
TableRepository.getTableColumnsInternal—customMetricsblockentityExtensionDAO.getExtensions(tableId, prefix)call, group by column name in Java.TableRepository.getTableColumnsInternal—extensionblockentityExtensionDAO.getExtensionsByJsonSchema(tableId, "columnExtension"), group by hashed column FQN.TableRepository.getTableColumnsInternal—profileblockpopulateEntityFieldTagswhentagsalready requested.TableRepository.searchTableColumnsInternal—customMetricsblockCollectionDAO.EntityExtensionDAOgetExtensionsByJsonSchema(id, jsonSchema)query.Test fixture: shorten classification/tag base names
TestNamespace.prefix()adds the namespace UUID + class name + 62-char method name. With base namesprofile_test_cls/profile_test_tag, the resultingtagFQNwas 263 chars — overflowingtag_usage.tagFQN VARCHAR(256). Shortened tocls/tag(test isolation preserved byTestNamespace.prefix()); newtagFQNis 237 chars.Verified at scale (6.94M-row pdts, 5.7 GB)
EXPLAIN of the batch query (real-data hashes that have profile rows):
Live API timings (5 runs each):
?fields=profile&include=all?fields=profile?fields=tags,customMetrics,extension,profile?fields=tags,profile?fields=customMetrics?fields=extension/tableProfile/latest?includeColumnProfile=true/tableProfile/latest?includeColumnProfile=false/tableProfile?startTs=…(1y)/columnProfile?startTs=…(1y)/systemProfile?startTs=…(1y)Deeply-nested struct table (4-level nesting, 16 flattened columns)
Profile data correctly attributed at every nesting depth.
valuesCountdiffers by depth (1001/1002/1003/1004) — proves values came from the actual seeded rows for each specific nested column, not mis-attributed across columns:/columns?fields=profile&include=allon this table: 0.191s (cold + JIT) → 0.027–0.039s (warm).Risk note: existing duplicates
CREATE UNIQUE INDEX CONCURRENTLYwill fail if any rows inprofiler_data_time_seriesviolate uniqueness on the four columns. After 8 months unconstrained on Postgres this is theoretically possible but very unlikely:timestampis producer-generated at millisecond resolution; concurrent writes need literal millisecond overlap.UPDATE, notINSERT, so reruns don't create duplicates.INSERT … SELECTfrom a fully unconstrained predecessor table without any dedup safety net, and shipped successfully — duplicates don't accumulate naturally even on hotter tables.If the build does fail, the error message includes the offending key tuple; a one-line targeted DELETE for those rows + retry handles it.
Migration Note
CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTSis idempotent — safe on rerun and fresh installs.ADD CONSTRAINT … UNIQUE USING INDEXdoes not re-scan; it promotes the existing index via catalog flip.ANALYZE profiler_data_time_seriesrefreshes stats so the planner picks the new index immediately.tag_usageCREATE INDEX CONCURRENTLYprecedent.Changes
bootstrap/sql/migrations/native/1.12.8/postgres/schemaChanges.sqlANALYZE.bootstrap/sql/migrations/native/2.0.2/{mysql,postgres}/schemaChanges.sqlCollectionDAO.java—ProfilerDataTimeSeriesDAO.getLatestExtensionsBatchAND p.entityFQNHash IN (<entityFQNHashes>)so the planner uses the unique index for outer access.CollectionDAO.java—EntityExtensionDAOgetExtensionsByJsonSchema(id, jsonSchema)query.TableRepository.javacustomMetricsandextensionfetches ingetTableColumnsInternal; skip duplicatepopulateEntityFieldTags; batchcustomMetricsinsearchTableColumnsInternal. New constantCOLUMN_EXTENSION_JSON_SCHEMA.TableResourceIT.javatest_getColumnsWithProfileField_correctnessAndNoBatchRegression); fixture base names shortened tocls/tagto fitVARCHAR(256).Test Plan
profile,tags,customMetrics,extension,profile, andtags,profilefield combinationstagsandprofilerequestedmvn compileclean;mvn spotless:applycleanIF NOT EXISTS+CONCURRENTLYguardsmvn verify -pl openmetadata-integration-tests -Dit.test=TableResourceIT)profiler_data_time_series: full API surface (11 endpoints) timed before/after, EXPLAIN confirms outer index usage post-fixGenerated by Claude Code