fix: Move version metadata migration to 2.0.0 and remove crashing MySQL multi-valued index#27816
fix: Move version metadata migration to 2.0.0 and remove crashing MySQL multi-valued index#27816
Conversation
…crashing MySQL multi-valued index
|
The Java checkstyle failed. Please run You can install the pre-commit hooks with |
There was a problem hiding this comment.
Pull request overview
This PR fixes a MySQL migration crash caused by a multi-valued JSON index on entity_extension.changedFieldKeys by removing the problematic index, moving the schema change to the 2.0.0 migration, and replacing the previous single-shot SQL backfill with a batched Java backfill in the v200 migration utilities.
Changes:
- Remove the
1.12.7MySQL/Postgres migration SQL files that added/backfilled version metadata (including the crashing MySQL multi-valued index). - Add
versionNum/changedFieldKeyscolumns and safe indexes to2.0.0schema changes (keep Postgres GIN; remove MySQL multi-valued index). - Implement and invoke a batched Java backfill (
backfillVersionMetadata) during v200 data migrations; move/extend unit tests accordingly.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| openmetadata-service/src/main/java/org/openmetadata/service/migration/utils/v200/MigrationUtil.java | Adds batched Java backfill logic plus helpers to compute versionNum and changedFieldKeys. |
| openmetadata-service/src/main/java/org/openmetadata/service/migration/mysql/v200/Migration.java | Invokes the new backfill as part of v200 MySQL data migration (non-blocking via try/catch). |
| openmetadata-service/src/main/java/org/openmetadata/service/migration/postgres/v200/Migration.java | Invokes the new backfill as part of v200 Postgres data migration (non-blocking via try/catch). |
| openmetadata-service/src/test/java/org/openmetadata/service/migration/utils/v200/MigrationUtilTest.java | Adds unit tests for extraction helpers + backfill behavior and SQL selection. |
| openmetadata-service/src/test/java/org/openmetadata/service/migration/utils/MigrationSqlStatementHashTest.java | Updates migration SQL hash assertions to validate the 2.0.0/mysql/schemaChanges.sql file. |
| bootstrap/sql/migrations/native/2.0.0/mysql/schemaChanges.sql | Adds versionNum/changedFieldKeys columns and (id, versionNum) index (no multi-valued JSON index). |
| bootstrap/sql/migrations/native/2.0.0/postgres/schemaChanges.sql | Adds versionNum/changedFieldKeys columns and indexes (incl. GIN on changedFieldKeys). |
| bootstrap/sql/migrations/native/1.12.7/mysql/schemaChanges.sql | Deleted (removes the crashing MySQL multi-valued index migration). |
| bootstrap/sql/migrations/native/1.12.7/mysql/postDataMigrationSQLScript.sql | Deleted (removes single-shot SQL backfill). |
| bootstrap/sql/migrations/native/1.12.7/postgres/schemaChanges.sql | Deleted (DDL moved to 2.0.0). |
| bootstrap/sql/migrations/native/1.12.7/postgres/postDataMigrationSQLScript.sql | Deleted (backfill moved to Java batching). |
Comments suppressed due to low confidence (1)
openmetadata-service/src/test/java/org/openmetadata/service/migration/utils/v200/MigrationUtilTest.java:54
- There are two identical imports for org.mockito.ArgumentCaptor in this file. Please remove the duplicate to keep imports clean and avoid potential formatter/checkstyle failures.
import org.mockito.ArgumentCaptor;
import org.mockito.MockedStatic;
import org.openmetadata.service.resources.databases.DatasourceConfig;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import org.mockito.ArgumentCaptor;
import org.openmetadata.schema.entity.activity.ActivityEvent;
| void backfillIsNoOpWhenQueryReturnsEmptyBatch() { | ||
| Handle bHandle = mock(Handle.class, RETURNS_DEEP_STUBS); | ||
| when(bHandle.createQuery(any(String.class)).bind(anyString(), anyInt()).mapToMap().list()) | ||
| .thenReturn(List.of()); | ||
|
|
There was a problem hiding this comment.
The backfillVersionMetadata tests are stubbing createQuery(...).bind(anyString(), anyInt())..., but the implementation binds two string params (lastId, lastExt) and then an int (limit). As written, these stubs won’t match the actual call chain and list() can return Mockito’s default null, causing the migration to NPE when it does batch.size(). Update the stubbing to match the real bind overloads/arguments (or stub at the mapToMap().list() level without binding matchers).
| assertDoesNotThrow(() -> MigrationUtil.backfillVersionMetadata(bHandle)); | ||
|
|
||
| ArgumentCaptor<String> sqlCaptor = ArgumentCaptor.forClass(String.class); | ||
| verify(bHandle).createUpdate(sqlCaptor.capture()); | ||
| String sql = sqlCaptor.getValue(); | ||
| assertTrue(sql.contains("changedFieldKeys = :changedFieldKeys")); | ||
| assertFalse(sql.contains("::jsonb")); |
There was a problem hiding this comment.
These new backfill tests assert/verify Handle.createUpdate(...) is used to run the update SQL, but the implementation uses handle.prepareBatch(updateSql) and executes a PreparedBatch. As a result the assertions around captured SQL (and the times(3) verification) won’t validate the actual behavior. Adjust the tests to mock/verify prepareBatch(...) and PreparedBatch.execute(), and only expect createUpdate(...) when explicitly testing the fallback path.
| + "WHERE (id, extension) > (:lastId, :lastExt) " | ||
| + "ORDER BY id, extension " | ||
| + "LIMIT :limit") |
There was a problem hiding this comment.
backfillVersionMetadata currently selects batches from entity_extension without filtering (versionNum IS NULL / extension LIKE '%.version.%'). That means rerunning the migration (or running it on Postgres instances that already populated these columns) will scan the entire table and rewrite every version row, which can generate large write amplification/WAL and extend upgrade time unnecessarily. Consider adding WHERE versionNum IS NULL AND extension LIKE '%.version.%' (and keeping the keyset pagination) so the backfill is idempotent and only touches rows that still need population.
| + "WHERE (id, extension) > (:lastId, :lastExt) " | |
| + "ORDER BY id, extension " | |
| + "LIMIT :limit") | |
| + "WHERE versionNum IS NULL " | |
| + "AND extension LIKE :extensionPattern " | |
| + "AND (id, extension) > (:lastId, :lastExt) " | |
| + "ORDER BY id, extension " | |
| + "LIMIT :limit") | |
| .bind("extensionPattern", "%.version.%") |
| totalProcessed += batch.size(); | ||
| if (!batch.isEmpty()) { | ||
| LOG.info("Backfilled {} entity_extension rows so far", totalProcessed); | ||
| } |
There was a problem hiding this comment.
The log counters in backfillVersionMetadata use totalProcessed += batch.size() and then log "Backfilled" rows, but the code may skip rows (non-version extensions, null id/extension) and only add batchCount updates. This makes the progress logs inaccurate and can be misleading during upgrades. Track/log the number of rows actually updated (e.g., batchCount or the batch execute result) separately from rows scanned.
| private static final int BATCH_SIZE = 5000; | ||
|
|
||
| private static final String UPDATE_MYSQL = | ||
| "UPDATE entity_extension SET versionNum = :versionNum, changedFieldKeys = :changedFieldKeys " | ||
| + "WHERE id = :id AND extension = :extension"; |
There was a problem hiding this comment.
BATCH_SIZE is set to 5000, which is much larger than batching used in other migration utils in this repo (commonly 100–500). A 5000-row select + JSON parse + batch update can increase transaction size and lock/WAL pressure during upgrades. Consider reducing this to a smaller value (and/or making it configurable) to keep migrations predictable on large datasets.
298a69c to
b49fa11
Compare
…data Replace WHERE versionNum IS NULL full-table scan (EXPLAIN: type=ALL, 2.1M rows/batch) with PK cursor pagination (EXPLAIN: type=index, key=PRIMARY, 5000 rows/batch). Use PreparedBatch to reduce UPDATE round trips from 5000 to 1 per batch. Bump BATCH_SIZE 1000 -> 5000.
b49fa11 to
322436a
Compare
Code Review ✅ Approved 7 resolved / 7 findingsMigration moved to 2.0.0 and incompatible MySQL indexes removed. Multiple critical stability issues addressed, including NullPointerExceptions, batch fallback data loss, and inefficient table scans. ✅ 7 resolved✅ Bug: NullPointerException if
|
| Compact |
|
Was this helpful? React with 👍 / 👎 | Gitar
| ALTER TABLE entity_extension | ||
| ADD COLUMN versionNum DOUBLE NULL, | ||
| ADD COLUMN changedFieldKeys JSON NULL; | ||
|
|
||
| CREATE INDEX idx_entity_extension_version_order | ||
| ON entity_extension (id, versionNum); |
| } catch (Exception versionOnlyEx) { | ||
| LOG.warn( | ||
| "Skipping row id={} extension={} after both updates failed", | ||
| id, | ||
| extension, | ||
| versionOnlyEx); | ||
| } |
|
🟡 Playwright Results — all passed (16 flaky)✅ 3966 passed · ❌ 0 failed · 🟡 16 flaky · ⏭️ 86 skipped
🟡 16 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |



Problem
The 1.12.7 migration (
e4d3e423e1) introduced two new columns onentity_extension—versionNumandchangedFieldKeys— to support version timeline filtering. It populated those columns via apostDataMigrationSQLScript.sqland backedchangedFieldKeyswith a MySQL multi-valued index.This caused an immediate crash in AUT:
Root cause — why the MySQL multi-valued index blows up
MySQL's multi-valued index on a JSON array reserves the full declared
CHAR(N)storage per element, per row in the index. The index was declared asCHAR(512)withutf8mb4, which reserves 2048 bytes per element. MySQL imposes a hard per-row limit of ~8 KB for multi-valued indexes. That means any entity version with more than ~4 changed field names in a single version would exceed the limit and crash theALTER TABLEorUPDATE.entity_extensionin production routinely has versions with dozens of changed fields (e.g. a bulk tag update can produce 50–300changedFieldKeysentries in one version row). The index design made the migration structurally impossible to run on real data.The Postgres GIN index on the same column does not have this problem — GIN stores posting lists per element without a per-row byte cap — so it is kept unchanged.
Why not partial indexing (index only the first N elements)?
MySQL's multi-valued index indexes all elements in the array or none. There is no way to index a subset. The only options were:
CHAR(N)— still fails for wide entities; just moves the thresholdWhy is the index safe to remove?
Every query that uses
changedFieldKeysalways starts withWHERE e.id = :id, scoping results to a single entity (typically 100–500 version rows).JSON_CONTAINSon 100–500 rows is negligible — the index provided no meaningful speedup. A legacy code path (getExtensionsWithFieldChangedLegacy) already works without the index and is already live in production.What changed
1. Removed the 1.12.7 migration entirely
All four SQL files created by the original PR are deleted. They did not exist before that commit, so reverting them leaves 1.12.7 clean.
2. Moved the DDL to 2.0.0 schemaChanges.sql
MySQL — appended to
2.0.0/mysql/schemaChanges.sql:No multi-valued index. The BTree index on
(id, versionNum)is kept for efficient DESC ordering of versions per entity.Postgres — appended to
2.0.0/postgres/schemaChanges.sql:GIN index retained — it works correctly on Postgres with no per-row size limit.
3. Replaced the SQL data migration with a batched Java migration
The original
postDataMigrationSQLScript.sqlwas a single-shot CTE that ran as one SQL statement over the entireentity_extensiontable. On large instances this would hold a lock for minutes and risk a timeout.The data migration is now a batched Java method in
v200/MigrationUtil.backfillVersionMetadata().Cursor-based pagination
The SELECT uses cursor-based pagination on the PRIMARY KEY instead of
WHERE versionNum IS NULL:Why this matters on large tables:
The naive
WHERE versionNum IS NULL LIMIT Napproach forces a full table scan on every batch. Verified withEXPLAINon a 2.1M-row table:WHERE versionNum IS NULL LIMIT 1000ALL(id, extension) > (x, y)indexPRIMARYWith the cursor, the PRIMARY KEY index drives the scan forward from the last position — each batch reads exactly
LIMITrows. TheLIKEandIS NULLfilters are applied as cheap post-filters within the cursor's range. This also makes re-runs safe:versionNum IS NULLskips already-processed rows so the migration is idempotent.PreparedBatch for updates
Instead of issuing one
execute()per row (N round trips per batch), all updates in a batch are sent as a singlePreparedBatch.execute()call — one round trip regardless of batch size.Fallback on batch failure
If
PreparedBatch.execute()fails (e.g. one row has a malformed JSON column), the code falls back to processing each row individually:versionNum+changedFieldKeys)versionNum-only update (guarantees the row won't be retried)This preserves
changedFieldKeysfor all healthy rows in the batch even when one row is problematic. The old per-row approach had the same two-tier fallback; the new approach restores that per-row safety net while still getting the throughput benefit ofPreparedBatchon the happy path.Performance summary (2.1M-row table)
4. Java vs SQL — behavioral equivalence
versionNumsourceCAST(SUBSTRING_INDEX(ext, '.version.', -1) AS DOUBLE)lastIndexOf(".version.")+Double.parseDouble— identical semanticsSELECT DISTINCTLinkedHashSet— sameWHERE field_name IS NOT NULL AND field_name <> ''nameNode.isNull()+isEmpty()— sameJSON_ARRAYAGG(arbitrary)JSON_CONTAINS5. Tests
v200/MigrationUtilTest(41 tests total).add()called per row), batch-level fallback triggering per-row retry, per-row full-update failure falling back to versionNum-onlyMigrationSqlStatementHashTestupdated to validate2.0.0/mysql/schemaChanges.sql(unique statement hashes, noINFORMATION_SCHEMAreferences)Pros / Cons
Pros
changedFieldKeysis preserved for all healthy rows even if one row is badCons
AND versionNum IS NULL) so it is safe to re-runchangedFieldKeysarray order differs from the original SQL (JSON_ARRAYAGGis unordered; Java sorts alphabetically). This has no functional impact since all consumers useJSON_CONTAINSwhich is order-agnosticADD COLUMNhas noIF NOT EXISTSguard (MySQL 8.0 limitation). If a MySQL environment somehow survived 1.12.7 with the columns already present, the 2.0.0 DDL would fail with "Duplicate column name". In practice this cannot happen since 1.12.7 crashed before the columns were created on MySQLTest plan
entity_extensionrows getversionNumandchangedFieldKeyspopulatedMigrationSqlStatementHashTestpasses in CIv200/MigrationUtilTest(41 tests) passes in CI