Skip to content

fix(profiler): N+1 / missing-index regression on /tables/.../columns?fields=profile (#3488)#27746

Open
sonika-shah wants to merge 6 commits intomainfrom
fix/profiler-n1-missing-index-3488
Open

fix(profiler): N+1 / missing-index regression on /tables/.../columns?fields=profile (#3488)#27746
sonika-shah wants to merge 6 commits intomainfrom
fix/profiler-n1-missing-index-3488

Conversation

@sonika-shah
Copy link
Copy Markdown
Collaborator

@sonika-shah sonika-shah commented Apr 26, 2026

Fixes collate#3488
Related to / extends #26855

Root Cause

/api/v1/tables/name/{fqn}/columns?fields=profile was timing out at 6 minutes (504) on Postgres at scale. Three compounding regressions, all rooted in 1.9.9:

1. Postgres: unique constraint dropped, never recreated

bootstrap/sql/migrations/native/1.9.9/postgres/schemaChanges.sql dropped profiler_data_time_series_unique_hash_extension_ts (entityFQNHash, extension, operation, timestamp) to alter the generated operation column expression, and never recreated it. Only (extension, timestamp) remained — useless for queries filtering by entityFQNHash.

MySQL was unaffected: MODIFY COLUMN re-evaluates the generated expression in place without touching the unique constraint.

2. Batch query has a planner pathology that makes the index unusable for the outer access

Even with the unique constraint restored, the original getLatestExtensionsBatch query:

SELECT p.entityFQNHash, p.json
FROM profiler_data_time_series p
JOIN ( SELECT entityFQNHash, MAX(timestamp) AS latestTs
       FROM profiler_data_time_series
       WHERE entityFQNHash IN (...) AND extension = :ext
       GROUP BY entityFQNHash ) latest
  ON p.entityFQNHash = latest.entityFQNHash AND p.timestamp = latest.latestTs
WHERE p.extension = :ext;

The inner subquery uses the index correctly. The outer side is filtered only by p.extension = :ext — the planner can't see that p.entityFQNHash must transitively equal one of the inner hashes (the constraint is implicit in the JOIN ON, not a WHERE predicate). On Postgres this means a parallel sequential scan over the full table for the outer access, even with the index in place.

3. N+1 patterns inside the columns endpoint

getTableColumnsInternal and searchTableColumnsInternal had three independent N+1s:

  • customMetrics: getCustomMetrics(table, column.getName()) called once per paginated column.
  • extension: getColumnExtension(table.getId(), column.getFullyQualifiedName()) called once per paginated column.
  • tags + profile together: populateEntityFieldTags called twice (once in each branch).

Fix

Postgres migration: restore the unique constraint

bootstrap/sql/migrations/native/1.12.8/postgres/schemaChanges.sql:

CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS
    profiler_data_time_series_unique_hash_extension_ts
    ON profiler_data_time_series (entityFQNHash, extension, operation, timestamp);

ALTER TABLE profiler_data_time_series
    ADD CONSTRAINT profiler_data_time_series_unique_hash_extension_ts
    UNIQUE USING INDEX profiler_data_time_series_unique_hash_extension_ts;

ANALYZE profiler_data_time_series;

Restoring the original 1.1.5 constraint (rather than adding a parallel narrower index) gives correctness, engine parity with MySQL, and a single index doing double duty for the query. Two-phase build avoids the ACCESS EXCLUSIVE lock.

MySQL: no migration needed

The unique constraint has been continuously present since 1.1.5; 1.9.9's MODIFY COLUMN didn't touch it.

SQL: push the IN list into the outer WHERE

CollectionDAO.ProfilerDataTimeSeriesDAO.getLatestExtensionsBatch:

       + "ON p.entityFQNHash = latest.entityFQNHash AND p.timestamp = latest.latestTs "
-      + "WHERE p.extension = :extension"
+      + "WHERE p.extension = :extension "
+      + "AND p.entityFQNHash IN (<entityFQNHashes>)"

The added clause is logically redundant — implied by the JOIN ON which forces p.entityFQNHash = latest.entityFQNHash and the inner WHERE which restricts latest.entityFQNHash to the IN list. By transitivity, p.entityFQNHash is already restricted to the same list. The planner can't derive that across the JOIN boundary, so we spell it out — which lets the optimizer use the unique index for outer access too.

Same SQL on both engines; no @ConnectionAwareSqlQuery split.

Java: batch the remaining N+1s

Location Fix
TableRepository.getTableColumnsInternalcustomMetrics block One entityExtensionDAO.getExtensions(tableId, prefix) call, group by column name in Java.
TableRepository.getTableColumnsInternalextension block New DAO method entityExtensionDAO.getExtensionsByJsonSchema(tableId, "columnExtension"), group by hashed column FQN.
TableRepository.getTableColumnsInternalprofile block Skip duplicate populateEntityFieldTags when tags already requested.
TableRepository.searchTableColumnsInternalcustomMetrics block Same batch pattern as above.
CollectionDAO.EntityExtensionDAO Add getExtensionsByJsonSchema(id, jsonSchema) query.

Test fixture: shorten classification/tag base names

TestNamespace.prefix() adds the namespace UUID + class name + 62-char method name. With base names profile_test_cls / profile_test_tag, the resulting tagFQN was 263 chars — overflowing tag_usage.tagFQN VARCHAR(256). Shortened to cls / tag (test isolation preserved by TestNamespace.prefix()); new tagFQN is 237 chars.

Verified at scale (6.94M-row pdts, 5.7 GB)

EXPLAIN of the batch query (real-data hashes that have profile rows):

Variant Execution Time Plan
Before SQL fix 7,234 ms Hash Join + Parallel Seq Scan over 6.7M rows
After SQL fix 79 ms Merge Join + Index Only Scan + Index Scan

Live API timings (5 runs each):

Endpoint Before After
?fields=profile&include=all 6–36 s 22–28 ms (warm) / 1.9 s (first cold + JIT)
?fields=profile 6–19 s 17–24 ms
?fields=tags,customMetrics,extension,profile 30–150 ms 22–85 ms
?fields=tags,profile 30–55 ms 18–50 ms
?fields=customMetrics 11–18 ms 11–19 ms
?fields=extension 10–13 ms 11–22 ms
/tableProfile/latest?includeColumnProfile=true 25–222 ms 20–181 ms
/tableProfile/latest?includeColumnProfile=false 9–15 ms 13–24 ms
/tableProfile?startTs=… (1y) 8–55 ms 7–17 ms
/columnProfile?startTs=… (1y) 18–21 ms 19–34 ms
/systemProfile?startTs=… (1y) 4–7 ms 4–9 ms

Deeply-nested struct table (4-level nesting, 16 flattened columns)

Profile data correctly attributed at every nesting depth. valuesCount differs by depth (1001/1002/1003/1004) — proves values came from the actual seeded rows for each specific nested column, not mis-attributed across columns:

[D1] address              profile: min=0 max=1000 valuesCount=1001.0
  [D2] coordinates        profile: min=0 max=1000 valuesCount=1002.0
    [D3] elevation        profile: min=0 max=1000 valuesCount=1003.0
      [D4] level          profile: min=0 max=1000 valuesCount=1004.0
      [D4] unit           profile: min=0 max=1000 valuesCount=1004.0
…

/columns?fields=profile&include=all on this table: 0.191s (cold + JIT) → 0.027–0.039s (warm).

Risk note: existing duplicates

CREATE UNIQUE INDEX CONCURRENTLY will fail if any rows in profiler_data_time_series violate uniqueness on the four columns. After 8 months unconstrained on Postgres this is theoretically possible but very unlikely:

  • timestamp is producer-generated at millisecond resolution; concurrent writes need literal millisecond overlap.
  • The 1.9.9 data migration used UPDATE, not INSERT, so reruns don't create duplicates.
  • 1.1.5 originally introduced this same constraint via INSERT … SELECT from a fully unconstrained predecessor table without any dedup safety net, and shipped successfully — duplicates don't accumulate naturally even on hotter tables.

If the build does fail, the error message includes the offending key tuple; a one-line targeted DELETE for those rows + retry handles it.

Migration Note

  • CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS is idempotent — safe on rerun and fresh installs.
  • ADD CONSTRAINT … UNIQUE USING INDEX does not re-scan; it promotes the existing index via catalog flip.
  • ANALYZE profiler_data_time_series refreshes stats so the planner picks the new index immediately.
  • Same Flyway path as 1.11.0's tag_usage CREATE INDEX CONCURRENTLY precedent.

Changes

File Change
bootstrap/sql/migrations/native/1.12.8/postgres/schemaChanges.sql New — restores the unique constraint via two-phase pattern + ANALYZE.
bootstrap/sql/migrations/native/2.0.2/{mysql,postgres}/schemaChanges.sql Deleted — superseded by 1.12.8.
CollectionDAO.javaProfilerDataTimeSeriesDAO.getLatestExtensionsBatch Added redundant AND p.entityFQNHash IN (<entityFQNHashes>) so the planner uses the unique index for outer access.
CollectionDAO.javaEntityExtensionDAO New getExtensionsByJsonSchema(id, jsonSchema) query.
TableRepository.java Batch customMetrics and extension fetches in getTableColumnsInternal; skip duplicate populateEntityFieldTags; batch customMetrics in searchTableColumnsInternal. New constant COLUMN_EXTENSION_JSON_SCHEMA.
TableResourceIT.java Regression test (test_getColumnsWithProfileField_correctnessAndNoBatchRegression); fixture base names shortened to cls/tag to fit VARCHAR(256).

Test Plan

  • Regression test covers profile, tags,customMetrics,extension,profile, and tags,profile field combinations
  • Verifies column profiles returned correctly (min/max values, null for unprofiled columns)
  • Verifies tags populated when both tags and profile requested
  • mvn compile clean; mvn spotless:apply clean
  • Migration uses IF NOT EXISTS + CONCURRENTLY guards
  • TableResourceIT full suite (293 tests, 9 skipped) — all pass (mvn verify -pl openmetadata-integration-tests -Dit.test=TableResourceIT)
  • End-to-end on a 6.94M-row profiler_data_time_series: full API surface (11 endpoints) timed before/after, EXPLAIN confirms outer index usage post-fix
  • Deeply-nested struct table validation: 4-level nested table, 16 flattened columns, all receive correctly attributed profile values

Generated by Claude Code

…fields=profile (#3488)

Root cause
----------
The 1.9.9 migration introduced two separate index regressions on
`profiler_data_time_series`:

1. **PostgreSQL**: `schemaChanges.sql` explicitly dropped the unique
   constraint `profiler_data_time_series_unique_hash_extension_ts`
   (entityFQNHash, extension, operation, timestamp) to allow altering the
   generated `operation` column expression, but never recreated it.  After
   the migration the table kept only the `(extension, timestamp)` index,
   which is useless for queries filtering by `entityFQNHash`.

2. **MySQL/both**: `postDataMigrationSQLScript.sql` created temporary indexes
   (idx_pdts_entityFQNHash, idx_pdts_composite, etc.) for its bulk UPDATE
   pass and then dropped **all** of them, including the only index covering
   `entityFQNHash`.

The batch query issued by `getLatestExtensionsBatch()` when
`fields=profile` is requested:

  SELECT entityFQNHash, MAX(timestamp) FROM profiler_data_time_series
  WHERE entityFQNHash IN (...N hashes...) AND extension = 'table.columnProfile'
  GROUP BY entityFQNHash

required an `(entityFQNHash, extension, timestamp)` index.  Without it the
database performs a full table scan.  On production deployments with
millions of profiler rows this caused 100+ second response times (Grafana:
106 770 ms; 99 % in DB; 93 dbOps).  Without `profile` in the fields param
the same endpoint returned in ~150-220 ms.

A secondary N+1 bug existed independently of the index: `customMetrics`
in fields called `getCustomMetrics(table, column)` once per paginated
column, issuing up to N identical queries against `entity_extension` and
then filtering in Java.

Fix
---
* **migration 2.0.2** (MySQL + PostgreSQL): `CREATE INDEX IF NOT EXISTS
  idx_pdts_fqnhash_ext_ts ON profiler_data_time_series(entityFQNHash,
  extension, timestamp)`.  The `IF NOT EXISTS` guard makes the migration
  safe to re-run and handles both upgrade and fresh-install paths.

* **`getTableColumnsInternal`** — `customMetrics` block: fetch all column
  custom metrics for the table in one query, group by column name in Java,
  then distribute.  Reduces N queries to 1.

* **`getTableColumnsInternal`** — `profile` block: skip the duplicate
  `populateEntityFieldTags` call when `tags` was already fetched earlier in
  the same request, saving one prefix-scan on `tag_usage` per request.

Related: PR #26855 (fixed N+1 tag queries on the list-tables path but left
the profiler-index and customMetrics N+1 untouched on the columns sub-path).
Copilot AI review requested due to automatic review settings April 26, 2026 20:55
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Apr 26, 2026
Comment thread bootstrap/sql/migrations/native/2.0.2/mysql/schemaChanges.sql Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Addresses a performance regression on the /tables/.../columns?fields=profile path by restoring an efficient composite index for profiler lookups and removing N+1 behavior when loading column customMetrics.

Changes:

  • Adds a composite index on profiler_data_time_series(entityFQNHash, extension, timestamp) for PostgreSQL and MySQL migrations.
  • Updates TableRepository#getTableColumnsInternal to fetch all column custom-metrics in one query and avoid duplicate tag population when tags is already requested.
  • Adds an integration regression test covering fields=profile and combinations with tags/customMetrics/extension.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/TableRepository.java Batches column custom-metrics retrieval and avoids duplicate tag population when profile is requested alongside tags.
openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/TableResourceIT.java Adds regression coverage for columns API with fields=profile and related combinations.
bootstrap/sql/migrations/native/2.0.2/postgres/schemaChanges.sql Adds composite index to prevent profiler full-table scans on PostgreSQL.
bootstrap/sql/migrations/native/2.0.2/mysql/schemaChanges.sql Adds composite index intended to prevent profiler full-table scans on MySQL.

Comment thread bootstrap/sql/migrations/native/2.0.2/mysql/schemaChanges.sql Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 26, 2026

🟡 Playwright Results — all passed (13 flaky)

✅ 3999 passed · ❌ 0 failed · 🟡 13 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 298 0 1 4
🟡 Shard 2 747 0 5 8
🟡 Shard 3 757 0 4 7
🟡 Shard 4 774 0 1 18
✅ Shard 5 687 0 0 41
🟡 Shard 6 736 0 2 8
🟡 13 flaky test(s) (passed on retry)
  • Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
  • Features/AdvancedSearch.spec.ts › Verify All conditions for Service Type field (shard 2, 1 retry)
  • Features/BulkEditEntity.spec.ts › Database Schema (shard 2, 1 retry)
  • Features/BulkImport.spec.ts › Keyboard Delete selection (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 2 retries)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/Table.spec.ts › Tags term should be consistent for search (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
  • Pages/AppStopRunModal.spec.ts › should close stop modal when cancel is clicked (shard 3, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for MlModel (shard 4, 1 retry)
  • Pages/Glossary.spec.ts › Column dropdown drag-and-drop functionality for Glossary Terms table (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

… + batch column extension/customMetrics fetch

Move the migration from 2.0.2/ to 1.12.8/ and switch from a non-unique
covering index to restoring the original unique constraint dropped in
1.9.9. The two-phase CREATE UNIQUE INDEX CONCURRENTLY + ADD CONSTRAINT
USING INDEX pattern avoids the ACCESS EXCLUSIVE lock on the hot
profiler_data_time_series table during the upgrade. Closes the 1.9.9
regression and brings Postgres back in line with MySQL (which never lost
the constraint). The leading (entityFQNHash, extension) prefix serves
the column-profile batch query — same shape MySQL has been running
without 504s. MySQL needs no migration.

Java side, eliminates two more N+1 patterns that compound the latency at
customer scale:

* getTableColumnsInternal extension block: replaced per-column
  getColumnExtension() loop with a single getExtensionsByJsonSchema()
  call, grouped by column FQN-hash in Java.
* searchTableColumnsInternal customMetrics block: applied the same
  batch-fetch pattern already used in getTableColumnsInternal, replacing
  per-column getCustomMetrics() with one getExtensions() call.

New DAO method on EntityExtensionDAO:
  getExtensionsByJsonSchema(id, jsonSchema) — selects extensions for a
  table id filtered by the jsonschema discriminator. Required because
  column extensions are stored with MD5-hashed extension keys and have
  no shared prefix the existing getExtensions(id, prefix) could use.
Copilot AI review requested due to automatic review settings May 7, 2026 09:08
Comment thread bootstrap/sql/migrations/native/1.12.8/postgres/schemaChanges.sql
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

Comment thread bootstrap/sql/migrations/native/1.12.8/postgres/schemaChanges.sql
Comment thread bootstrap/sql/migrations/native/1.12.8/postgres/schemaChanges.sql
…pushing IN list to the join

The getLatestExtensionsBatch query was the right shape for correctness but
the planner — on Postgres at customer scale, with the new unique constraint
in place — was still choosing a parallel sequential scan over the full
profiler_data_time_series table for the outer side of the JOIN, rather than
a merge join with index scan on both sides.

Inner subquery: filtered by `entityFQNHash IN (...)`, used the index.
Outer: only filtered by `p.extension = :extension`, no IN list, planner
couldn't infer the transitive constraint that p.entityFQNHash must equal
one of the inner hashes (because it's enforced through the JOIN ON clause,
not a WHERE predicate). Result: full table scan reading 6.7M+ rows even
when the actual answer is 23 rows.

Adding the redundant `AND p.entityFQNHash IN (<entityFQNHashes>)` to the
outer WHERE makes the constraint explicit. The result set is unchanged
(implied by the join condition), but the planner can now use the unique
index for the outer access too.

Verified on the AUT dump (6.94M-row pdts):
  EXPLAIN of the batch query: 7,234ms → 79ms (Hash Join + Parallel Seq
  Scan → Merge Join + Index Only Scan).
  Live API /columns?fields=profile&include=all: 6-36 seconds → 22-28ms
  (warm) / 1.9s (very first call). 250-1000x improvement, depending on
  cache state.

Same SQL works on both engines; no @ConnectionAwareSqlQuery split needed.
Copilot AI review requested due to automatic review settings May 7, 2026 18:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

… varchar(256)

The IT fixture for test_getColumnsWithProfileField_correctnessAndNoBatchRegression
was building a tagFQN of `<classification>.<tag>` where each part went through
TestNamespace.prefix(). With the descriptive method name (62 chars) + class
name (15 chars) + namespace UUID (32 chars) plus the `profile_test_cls` /
`profile_test_tag` base names (16 chars each), the resulting tagFQN was 263
characters — over the tag_usage.tagFQN VARCHAR(256) limit:

  ERROR: value too long for type character varying(256)

Shorten the fixture base names from `profile_test_cls`/`profile_test_tag` to
`cls`/`tag`. The namespace prefix already encodes test isolation (class +
method + UUID), so the base name doesn't need to repeat that context.

New tagFQN length: 237 chars (cls__<32>__TableResourceIT__<62>.tag__<32>__TableResourceIT__<62>),
comfortably under 256.
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 7, 2026

Code Review ✅ Approved 3 resolved / 3 findings

Restores the missing Postgres unique constraint to fix query performance and optimizes N+1 database patterns in table column retrieval. All identified issues have been resolved.

✅ 3 resolved
Bug: MySQL migration uses unsupported IF NOT EXISTS for CREATE INDEX

📄 bootstrap/sql/migrations/native/2.0.2/mysql/schemaChanges.sql:18-19
The MySQL migration uses CREATE INDEX IF NOT EXISTS, but this syntax is not supported in MySQL prior to 8.0.29. No other MySQL migration in the project uses this pattern — all 148 other CREATE INDEX statements in mysql/*.sql files use plain CREATE INDEX without IF NOT EXISTS. The project's docker-compose pins mysql:8.0, which today resolves to ≥8.0.29, but users running older 8.0.x versions (e.g., 8.0.28 or below) will get a syntax error and the migration will fail, blocking the upgrade.

PostgreSQL correctly supports CREATE INDEX IF NOT EXISTS in all supported versions, so the Postgres file is fine.

Quality: Unnecessary ArrayList allocation per column without metrics

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/TableRepository.java:2908
getOrDefault(column.getName(), new ArrayList<>()) allocates a new ArrayList for every column that has no custom metrics. Since the returned list is only set on the column and never mutated afterwards, an immutable empty list would avoid unnecessary allocations.

Bug: ALTER TABLE ADD CONSTRAINT not idempotent — fails on re-run

📄 bootstrap/sql/migrations/native/1.12.8/postgres/schemaChanges.sql:10-12
The migration at 1.12.8/postgres/schemaChanges.sql uses ALTER TABLE ... ADD CONSTRAINT ... UNIQUE USING INDEX without an idempotency guard. While CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS is safe to re-run, the subsequent ALTER TABLE ADD CONSTRAINT will raise an error if the constraint already exists (e.g., if the migration is replayed or the system already has the constraint from a different path). PostgreSQL does not support ADD CONSTRAINT IF NOT EXISTS.

The project's established pattern (seen in 1.13.0/postgres/schemaChanges.sql:27-49) wraps constraint creation in a DO $$ ... $$ block that checks pg_constraint before executing. The migration should follow the same pattern to be truly idempotent.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

+ "ON p.entityFQNHash = latest.entityFQNHash AND p.timestamp = latest.latestTs "
+ "WHERE p.extension = :extension")
+ "WHERE p.extension = :extension "
+ "AND p.entityFQNHash IN (<entityFQNHashes>)")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what benefits we get in terms of performance by adding entityFqnHashes here since we already have this in the inner join table predicate.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 7, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants