Skip to content

Add db-tune ops subcommand + production RDS runbook#27890

Merged
harshach merged 8 commits into
mainfrom
harshach/rds-perf-doc
May 11, 2026
Merged

Add db-tune ops subcommand + production RDS runbook#27890
harshach merged 8 commits into
mainfrom
harshach/rds-perf-doc

Conversation

@harshach
Copy link
Copy Markdown
Collaborator

@harshach harshach commented May 4, 2026

Summary

  • New `./openmetadata-ops.sh db-tune` subcommand: read-only tuning report by default, with `--apply` to execute and `--analyze` to refresh planner stats. Postgres autovacuum reloptions; MySQL InnoDB persistent-stats reloptions. Pure heuristic + formatter is unit-tested; live-DB flow is covered by `DbTuneIT` against the IT bootstrap's Postgres / MySQL Testcontainer.

Sample input / output

Report mode (default — no DB writes)

```
$ ./openmetadata-ops.sh db-tune

Database engine: PostgreSQL 17.2

=== Server-level parameter compliance ===
+---------------------------------+---------+-------------------+-------------+-------------------------+
| Parameter | Current | Recommended | Status | Note |
+---------------------------------+---------+-------------------+-------------+-------------------------+
| shared_buffers | 16384 | 40% of RAM | UNTUNED | RAM-relative; verify |
| effective_cache_size | 524288 | 75% of RAM | UNTUNED | RAM-relative; verify |
| work_mem | 4096 | 131072 | UNDERSIZED | |
| maintenance_work_mem | 65536 | 2097152 | UNDERSIZED | |
| random_page_cost | 4 | 1.1 | UNDERSIZED | |
| effective_io_concurrency | 1 | 200 | UNDERSIZED | |
| max_parallel_workers_per_gather | 2 | 4 | UNDERSIZED | |
| autovacuum_naptime | 60 | 15 | UNDERSIZED | |
| autovacuum_vacuum_scale_factor | 0.2 | 0.05 | UNDERSIZED | |
| autovacuum_analyze_scale_factor | 0.1 | 0.02 | UNDERSIZED | |
+---------------------------------+---------+-------------------+-------------+-------------------------+

These cannot be applied by this tool — change them in your DB parameter group / RDS console.

=== Per-table recommendations (26 tables) ===
+--------------------------+-----------+--------+-----------+----------------------------------------+---------+----------------------------------+
| Table | Rows | Size | Current | Recommended | Action | Reason |
+--------------------------+-----------+--------+-----------+----------------------------------------+---------+----------------------------------+
| entity_relationship | 8,400,000 | 3.2 GB | (default) | autovacuum_analyze_scale_factor=0.005, | APPLY | Join target, write-heavy |
| | | | | autovacuum_vacuum_cost_limit=4000, | | |
| | | | | autovacuum_vacuum_scale_factor=0.01 | | |
| tag_usage | 7,402,343 | 30 GB | autovac.. | autovacuum_analyze_scale_factor=0.005, | TIGHTEN | Hottest table on read path |
| | | | | autovacuum_vacuum_cost_delay=0, | | |
| | | | | autovacuum_vacuum_cost_limit=4000, | | |
| | | | | autovacuum_vacuum_scale_factor=0.01 | | |
| storage_container_entity | 580,123 | 1.5 GB | (default) | autovacuum_analyze_scale_factor=0.01, | APPLY | Large entity table; tighten |
| | | | | autovacuum_vacuum_scale_factor=0.02 | | autovacuum so list count stats |
| | | | | | | stay fresh |
| table_entity | 420,000 | 2.1 GB | (default) | autovacuum_analyze_scale_factor=0.01, | APPLY | Large entity table |
| | | | | autovacuum_vacuum_scale_factor=0.02 | | |
| change_event | 12,400,000| 4.0 GB | (default) | autovacuum_analyze_scale_factor=0.1, | RELAX | Append-only, relax autovacuum |
| | | | | autovacuum_vacuum_scale_factor=0.2 | | |
| dashboard_entity | 8,500 | 32 MB | (default) | (default) | SKIP | Row count 8500 below threshold |
| | | | | | | 10000 |
| api_collection_entity | 0 | 0 KB | (default) | (default) | SKIP | Row count 0 below threshold |
| | | | | | | 10000 |
+--------------------------+-----------+--------+-----------+----------------------------------------+---------+----------------------------------+

Next steps:
./openmetadata-ops.sh db-tune --apply --analyze # apply + refresh planner stats
./openmetadata-ops.sh db-tune --apply # apply only; run analyze-tables later
```

Apply mode (interactive prompt)

```
$ ./openmetadata-ops.sh db-tune --apply

[ ... same report as above ... ]

About to apply 5 ALTER statements:
ALTER TABLE "entity_relationship" SET (autovacuum_analyze_scale_factor = 0.005, autovacuum_vacuum_cost_limit = 4000, autovacuum_vacuum_scale_factor = 0.01);
ALTER TABLE "tag_usage" SET (autovacuum_analyze_scale_factor = 0.005, autovacuum_vacuum_cost_delay = 0, autovacuum_vacuum_cost_limit = 4000, autovacuum_vacuum_scale_factor = 0.01);
ALTER TABLE "storage_container_entity" SET (autovacuum_analyze_scale_factor = 0.01, autovacuum_vacuum_scale_factor = 0.02);
ALTER TABLE "table_entity" SET (autovacuum_analyze_scale_factor = 0.01, autovacuum_vacuum_scale_factor = 0.02);
ALTER TABLE "change_event" SET (autovacuum_analyze_scale_factor = 0.1, autovacuum_vacuum_scale_factor = 0.2);

Apply now? [y/N]: y

+--------------------------+---------+--------+----------+
| Table | Action | Status | Details |
+--------------------------+---------+--------+----------+
| entity_relationship | APPLY | OK | Applied |
| tag_usage | TIGHTEN | OK | Applied |
| storage_container_entity | APPLY | OK | Applied |
| table_entity | APPLY | OK | Applied |
| change_event | RELAX | OK | Applied |
+--------------------------+---------+--------+----------+
```

Apply + refresh planner stats

```
$ ./openmetadata-ops.sh db-tune --apply --yes --analyze

[ ... report ... ]

+--------------------------+---------+--------+--------------------+
| Table | Action | Status | Details |
+--------------------------+---------+--------+--------------------+
| entity_relationship | APPLY | OK | Applied + analyzed |
| tag_usage | TIGHTEN | OK | Applied + analyzed |
| storage_container_entity | APPLY | OK | Applied + analyzed |
| ... | ... | ... | ... |
+--------------------------+---------+--------+--------------------+
```

MySQL output

```
$ ./openmetadata-ops.sh db-tune

Database engine: MySQL 8.0.36

=== Server-level parameter compliance ===
+--------------------------------------+----------+-----------------------+-------------+
| Parameter | Current | Recommended | Status |
+--------------------------------------+----------+-----------------------+-------------+
| innodb_buffer_pool_size | 1.0 GB | 40-60% of RAM | UNTUNED |
| innodb_io_capacity | 200 | 2000 | UNDERSIZED |
| innodb_io_capacity_max | 2000 | 4000 | UNDERSIZED |
| innodb_stats_persistent_sample_pages | 20 | 64 | UNDERSIZED |
| ... | ... | ... | ... |
+--------------------------------------+----------+-----------------------+-------------+

=== Per-table recommendations (25 tables) ===
+--------------------------+-----------+--------+-----------+--------------------------+--------+----------------------------------+
| Table | Rows | Size | Current | Recommended | Action | Reason |
+--------------------------+-----------+--------+-----------+--------------------------+--------+----------------------------------+
| storage_container_entity | 580,123 | 1.5 GB | (default) | STATS_AUTO_RECALC=1, | APPLY | Large entity table; bump InnoDB |
| | | | | STATS_PERSISTENT=1, | | stats sampling |
| | | | | STATS_SAMPLE_PAGES=64 | | |
| ... | ... | ... | ... | ... | ... | ... |
+--------------------------+-----------+--------+-----------+--------------------------+--------+----------------------------------+
```

The `--apply` SQL on MySQL uses `ALTER TABLE \`name\` STATS_PERSISTENT=1, STATS_AUTO_RECALC=1, STATS_SAMPLE_PAGES=64` (no parens, no equals-spaces — MySQL syntax).

Architecture

```
openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/
├── Action.java # APPLY / TIGHTEN / RELAX / OK / SKIP
├── TableStats.java # input record (rows, bytes, current settings)
├── TableRecommendation.java # output record (action, current, recommended, reason)
├── ServerParamCheck.java # per-parameter compliance row
├── DbTuneResult.java # bundle: engine, version, params, recommendations
├── AutoTuner.java # interface: analyze / recommend / apply / analyzeOne
├── PostgresTuningCatalog.java # tier map: HOT_RELATIONSHIP, HOT_TAG_USAGE, ENTITY_LARGE, ENTITY_SERVICE, APPEND_ONLY
├── PostgresAutoTuner.java # autovacuum reloptions + ANALYZE
├── MysqlTuningCatalog.java # tier map: HOT, ENTITY_LARGE, ENTITY_SERVICE
├── MysqlAutoTuner.java # InnoDB STATS_* + ANALYZE TABLE
└── DbTuneReport.java # ASCII renderer + ALTER-statement preview
```

`recommend(stats)` is a pure function — input is observed table state, output is the action. Tested without a database. `apply()` and `analyzeOne()` are the I/O methods, tested in `DbTuneIT` against a live Testcontainer.

Heuristic

Tier Tables Recommended (Postgres) Recommended (MySQL) Row threshold
HOT entity_relationship, tag_usage scale=0.005/0.01, cost_limit=4000 (+cost_delay=0 for tag_usage) STATS_SAMPLE_PAGES=100 0
ENTITY_LARGE 22 entity tables scale=0.01/0.02 STATS_SAMPLE_PAGES=64 10,000
ENTITY_SERVICE database_entity, database_schema_entity scale=0.02/0.05 STATS_SAMPLE_PAGES=32 5,000
APPEND_ONLY change_event scale=0.1/0.2 (RELAX) (n/a — InnoDB has no relax knob) 50,000

Tests

  • 38 unit tests, all green: `PostgresAutoTunerTest` (20), `MysqlAutoTunerTest` (11), `DbTuneReportTest` (7).
  • `DbTuneIT` runs against the IT bootstrap's Testcontainer (Postgres by default; `-DdatabaseType=mysql` for MySQL): asserts `analyze()` returns recommendations, `apply()` writes reloptions and is idempotent, `analyzeOne()` runs without error, dry-run analyze does not mutate settings. Resets the modified table options in `finally`.

Test plan

  • CI: unit tests pass (`mvn test -Dtest='PostgresAutoTunerTest,MysqlAutoTunerTest,DbTuneReportTest'`)
  • CI: `DbTuneIT` passes against the Postgres profile
  • CI: `DbTuneIT` passes against the MySQL profile (`-DdatabaseType=mysql`)
  • Local: `./openmetadata-ops.sh db-tune` against a populated dev DB — verify report shape + sane recommendations
  • Local: `./openmetadata-ops.sh db-tune --apply --yes` followed by a second `db-tune` shows everything as OK (idempotency)
  • Local: `./openmetadata-ops.sh db-tune --apply --yes --analyze` completes without error and refreshes planner stats

🤖 Generated with Claude Code


Summary by Gitar

  • Diagnostic refinements:
    • Improved PostgresDiagnostic output by implementing nullSafe formatting for SQL result sets and standardizing decimal precision for scan ratios.
    • Updated interactive CLI prompt in OpenMetadataOperations to use nextLine() for better Enter key handling.
    • Added unit tests in DiagnosticReportTest to verify null-safe string handling and ratio calculation logic.

This will update automatically on new commits.

Generate a per-table autovacuum (Postgres) / InnoDB stats (MySQL)
tuning report from the live database; --apply executes the ALTER
statements after operator confirmation; --analyze refreshes planner
stats on the changed tables. Heuristic skips small dev tables
(row-count gated) and recognizes already-tightened settings as OK.

Catalog values come from the 600k-container production audit. The
companion runbook in docs/rds-postgres-production-runbook.md captures
the parameter-group settings, extensions, and the indexes that must be
applied by hand because the migration runner can't dedup cross-version.

Tests: 38 unit tests on the heuristic + report formatter; DbTuneIT
covers analyze / apply / analyzeOne / dry-run against the IT
bootstrap's live Postgres or MySQL Testcontainer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 4, 2026 18:55
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels May 4, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an operator-facing db-tune ops subcommand plus a production RDS Postgres runbook, backed by a new dbtune Java utility package that can analyze table stats, produce tuning recommendations, and (optionally) apply per-table settings and refresh planner stats for Postgres/MySQL.

Changes:

  • Introduces org.openmetadata.service.util.dbtune (catalogs, engine-specific tuners, report renderer, result models).
  • Wires a new db-tune subcommand into OpenMetadataOperations with --apply/--yes/--analyze.
  • Adds unit tests for heuristic/report formatting and an integration test (DbTuneIT), plus an RDS Postgres production runbook.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
openmetadata-service/src/main/java/org/openmetadata/service/util/OpenMetadataOperations.java Adds db-tune CLI command, prompts, and apply/analyze execution flow
openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/Action.java Defines actionable vs non-actionable recommendation actions
openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/AutoTuner.java Defines the engine-specific tuner contract (analyze/recommend/apply/analyzeOne/sql builder)
openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/DbTuneReport.java Renders ASCII report output and ALTER statement previews
openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/DbTuneResult.java Bundles engine/version + server param checks + per-table recommendations
openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/MysqlAutoTuner.java Implements MySQL stats/options inspection, recommendations, and ALTER/ANALYZE generation
openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/MysqlTuningCatalog.java Maps MySQL tables to stats sampling “profiles” and thresholds
openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/PostgresAutoTuner.java Implements Postgres reloptions inspection, recommendations, and ALTER/ANALYZE generation
openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/PostgresTuningCatalog.java Maps Postgres tables to autovacuum “profiles” and thresholds
openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/ServerParamCheck.java Models server-level parameter compliance rows
openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/TableRecommendation.java Models per-table recommendation outputs
openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/TableStats.java Models per-table observed stats/settings inputs
openmetadata-service/src/test/java/org/openmetadata/service/util/dbtune/DbTuneReportTest.java Unit tests for report formatting helpers and sections
openmetadata-service/src/test/java/org/openmetadata/service/util/dbtune/MysqlAutoTunerTest.java Unit tests for MySQL heuristic decisions, parsing, and SQL formatting
openmetadata-service/src/test/java/org/openmetadata/service/util/dbtune/PostgresAutoTunerTest.java Unit tests for Postgres heuristic decisions, parsing, quoting, and SQL formatting
openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/DbTuneIT.java Integration test exercising analyze/apply/analyze-one against Testcontainers DB
docs/rds-postgres-production-runbook.md Consolidated ops runbook for RDS Postgres parameter groups, extensions, table overrides, and manual indexes

Comment on lines +33 to +41
/**
* End-to-end tests for {@link AutoTuner} against the live Testcontainers database. The bootstrap
* runs every migration up to the current version, so the tracked entity tables (e.g.
* {@code storage_container_entity}) exist; we exercise the analyze → apply → analyze-one path
* against the real schema and reset the modified reloptions / table options at the end.
*
* <p>Sequential because each test mutates table-level reloptions on shared production tables;
* parallel execution would race between read-stats and apply.
*/
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a719d47: added @execution(ExecutionMode.SAME_THREAD) on the class so the tests don't race on shared reloptions when the suite runs in parallel.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

🔴 Playwright Results — 1 failure(s), 15 flaky

✅ 3988 passed · ❌ 1 failed · 🟡 15 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 298 0 1 4
🔴 Shard 2 746 1 3 8
🟡 Shard 3 751 0 4 7
🟡 Shard 4 773 0 2 18
🟡 Shard 5 685 0 2 41
🟡 Shard 6 735 0 3 8

Genuine Failures (failed on all attempts)

Features/Container.spec.ts › Container page children pagination (shard 2)
�[31mTest timeout of 180000ms exceeded.�[39m
🟡 15 flaky test(s) (passed on retry)
  • Pages/AuditLogs.spec.ts › should apply both User and EntityType filters simultaneously (shard 1, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/DataProductPersonaCustomization.spec.ts › Data Product - customize tab label should only render if it's customized by user (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Flow/AddRoleAndAssignToUser.spec.ts › Verify assigned role to new user (shard 3, 1 retry)
  • Flow/PersonaDeletionUserProfile.spec.ts › User profile loads correctly before and after persona deletion (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
  • Pages/DataContractInheritance.spec.ts › Partial Contract Inheritance - Asset contract merges with Data Product contract (shard 4, 1 retry)
  • Pages/DomainUIInteractions.spec.ts › Delete subdomain via UI (shard 4, 1 retry)
  • Pages/EntityDataConsumer.spec.ts › Tier Add, Update and Remove (shard 5, 1 retry)
  • Pages/EntityDataConsumer.spec.ts › Tier Add, Update and Remove (shard 5, 1 retry)
  • Pages/Glossary.spec.ts › Add, Update and Verify Data Glossary Term (shard 6, 1 retry)
  • Pages/Glossary.spec.ts › Column dropdown drag-and-drop functionality for Glossary Terms table (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Copilot AI review requested due to automatic review settings May 5, 2026 17:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

…nHeavy

Two findings from the review round on 2d9f047:

1. DbTuneIT.applyExecutesAndIsIdempotent stopped verifying that apply()
   actually persisted the settings to the DB after the rewrite to an
   isolated test table — it only asserted the ALTER statement
   contained the table name and that apply() didn't throw. The earlier
   shape (assertSettingsMatch) was load-bearing for catching regressions
   where apply() silently no-ops. Restored end-to-end verification.

   Added AutoTuner.currentSettingsForTable(Handle, String) so the test
   can read back settings on a table that is NOT in the static catalog
   (the isolated dbtune_it_isolated_table). Implemented for both
   engines by reusing the existing parseReloptions / parseCreateOptions
   logic. Falls out cleanly from a single pg_class / INFORMATION_SCHEMA
   query; no duplication.

   The IT now: builds the recommendation, asserts the generated SQL is
   well-formed, applies it, reads back via currentSettingsForTable,
   asserts each recommended key/value persisted numerically (handles
   PG lowercase / MySQL uppercase key conventions via case-insensitive
   lookup), then re-applies and asserts the snapshot is byte-equal.

2. PostgresDiagnostic.seqScanHeavy WHERE clause used
   seq_scan::numeric / NULLIF(idx_scan, 0) > :ratio, which silently
   excludes tables with idx_scan=0 — exactly the worst-case candidates
   for a missing index. The Java mapper even has a dead-code branch
   handling idx_scan=0 ("∞" ratio display) that could never trigger.

   Changed predicate to (idx_scan = 0 OR seq_scan::numeric/idx_scan > :ratio).
   The mapper's ∞ branch now actually executes. Inline comment added
   explaining the choice so a future audit doesn't undo it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 11, 2026 14:20
@harshach
Copy link
Copy Markdown
Collaborator Author

Round 3 review addressed in 4951a04.

Finding Fix
applyExecutesAndIsIdempotent stopped verifying DB state after the isolated-table rewrite Restored end-to-end verification. Added `AutoTuner.currentSettingsForTable(Handle, String)` so the test can read back settings on a table that's not in the static catalog (`dbtune_it_isolated_table`). The IT now applies, reads back via the new method, asserts each recommended key/value persisted numerically (handles PG lowercase / MySQL uppercase via case-insensitive lookup), then re-applies and asserts the snapshot is byte-equal. The earlier `assertSettingsMatch` shape is restored end-to-end against the isolated table — no risk to shared production tables.
`seqScanHeavy` SQL silently excluded `idx_scan=0` tables — exactly the worst-case missing-index candidates Changed the WHERE clause from `seq_scan::numeric / NULLIF(idx_scan, 0) > :ratio` to `(idx_scan = 0 OR seq_scan::numeric/idx_scan > :ratio)`. The Java mapper's `idx_scan == 0 → "∞"` branch (which was unreachable before) now actually runs. Added an inline comment so a future audit doesn't undo the predicate.

Unit tests stay at 50 passing. Both modules build clean.

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.

@SuppressWarnings("resource")
Scanner scanner = new Scanner(System.in);
LOG.info("Apply now? [y/N]: ");
String input = scanner.hasNext() ? scanner.next().trim().toLowerCase() : "";
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1928f2f: switched to scanner.nextLine() and treat blank input (and EOF) as 'no', matching the [y/N] convention. Bare Enter now dismisses the prompt cleanly.

…ders blank, ratio uses double

Four findings from the round on 4951a04:

1. confirmApply prompt blocked on bare Enter
   scanner.hasNext()/next() are token-based, so a user pressing Enter to
   accept the [y/N] default would hang waiting for a non-whitespace
   token. Switched to scanner.nextLine() and treat empty input (and EOF)
   as "no" — matches the [y/N] convention.

2. & 3. NULL timestamps rendered as literal "null"
   highDeadTuples.last_vacuum and staleStats.last_analyzed are nullable
   columns; String.valueOf(rs.getString(...)) maps SQL NULL to the four-
   character string "null", which the renderer then prints verbatim in
   the diagnostic table. Added nullSafe() helper that returns "" for
   null and replaced both call sites.

4. seqScanHeavy ratio used integer division
   seq_scan / idx_scan as longs reports 15/2 → 7, masking how
   seq-scan-heavy a table actually is. Extracted formatSeqIdxRatio()
   that does (double) seqScan / idxScan formatted to one decimal, and
   still returns "∞" for the idx_scan=0 worst case. The previous fix to
   include idx_scan=0 in the SQL (4951a04) means the "∞" branch now
   actually fires.

Tests: 53 passing (+3 for the new helpers — nullSafe + formatSeqIdxRatio
across normal, zero, and one-decimal cases).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@harshach
Copy link
Copy Markdown
Collaborator Author

Round 4 review (4264516128) — addressed all four in 1928f2f.

# Finding Fix
1 confirmApply prompt blocked on bare Enter (scanner.next() is token-based) Switched to scanner.nextLine(); blank input and EOF both map to "no". Bare Enter now dismisses the [y/N] prompt as expected.
2 highDeadTuples.last_vacuum rendered SQL NULL as literal "null" Added nullSafe() helper that returns "" for null. Applied.
3 staleStats.last_analyzed same issue Same helper applied. Tables that have never been analyzed now show an empty cell instead of the string "null".
4 seqScanHeavy.ratio used integer division (seq_scan / idx_scan as longs) — 15/2 reported as 7 Extracted formatSeqIdxRatio(seqScan, idxScan) using (double) division formatted to one decimal. idx_scan=0 keeps the "∞" rendering, which now actually fires for the worst-case missing-index findings (since the SQL predicate stopped excluding them in 4951a04).

53 unit tests passing (was 50, +3 for the new helpers: nullSafe_returnsEmptyForNullAndUntouchedForNonNull, formatSeqIdxRatio_usesDoubleDivisionAndOneDecimal, formatSeqIdxRatio_zeroIdxScansRendersInfinity).

🤖 Generated with Claude Code

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 11, 2026

Code Review ✅ Approved 3 resolved / 3 findings

Introduces a db-tune subcommand for automated Postgres and MySQL database performance recommendations, including interactive apply and analyze workflows. Resolved an NPE in diagnostic output, improved prompt handling, and refined scan ratio calculations.

✅ 3 resolved
Edge Case: NPE in catch block when exception message is null

📄 openmetadata-service/src/main/java/org/openmetadata/service/util/OpenMetadataOperations.java:2570
List.of() does not accept null elements. If the caught exception has a null message (e.g., a raw NullPointerException or IllegalStateException without a message string), e.getMessage() returns null and List.of(...) throws a secondary NullPointerException, swallowing the original error and producing an unhelpful stack trace.

Quality: applyExecutesAndIsIdempotent no longer verifies DB state

📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/DbTuneIT.java:117-123
The old test used assertSettingsMatch to confirm settings were actually persisted in the database after apply(). The new test only verifies that buildAlterStatement returns a string containing the table name and that apply() doesn't throw. While the rationale for using an isolated table is solid (avoiding MySQL metadata invalidation), the test could still read back the settings from the isolated table to confirm they took effect — there's no risk of side-effects on shared tables since ISOLATED_TABLE is private to this test.

This would restore the end-to-end confidence that apply() actually mutates the DB, rather than just not crashing.

Edge Case: seqScanHeavy excludes worst-case tables (idx_scan=0)

📄 openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/PostgresDiagnostic.java:183-197
The SQL WHERE clause uses seq_scan::numeric / NULLIF(idx_scan, 0) > :ratio. When idx_scan = 0, NULLIF returns NULL making the whole comparison NULL (falsy), so those rows are excluded from results. Ironically, tables with many sequential scans and zero index scans are the most concerning candidates for a missing index, yet they'll never appear in findings.

The Java mapper at line 204 has a dead-code branch handling idx_scan == 0 (returns "∞") that can never execute given the query filter.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

@harshach harshach merged commit 2859cbe into main May 11, 2026
53 of 58 checks passed
@harshach harshach deleted the harshach/rds-perf-doc branch May 11, 2026 21:05
harshach added a commit that referenced this pull request May 11, 2026
* Add db-tune ops command + production RDS runbook

Generate a per-table autovacuum (Postgres) / InnoDB stats (MySQL)
tuning report from the live database; --apply executes the ALTER
statements after operator confirmation; --analyze refreshes planner
stats on the changed tables. Heuristic skips small dev tables
(row-count gated) and recognizes already-tightened settings as OK.

Catalog values come from the 600k-container production audit. The
companion runbook in docs/rds-postgres-production-runbook.md captures
the parameter-group settings, extensions, and the indexes that must be
applied by hand because the migration runner can't dedup cross-version.

Tests: 38 unit tests on the heuristic + report formatter; DbTuneIT
covers analyze / apply / analyzeOne / dry-run against the IT
bootstrap's live Postgres or MySQL Testcontainer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Remove file

* db-tune: address review 4230257768 — direction-agnostic status, empty/match disambiguation, script path

- ServerParamCheck.STATUS_UNDERSIZED -> STATUS_MISMATCH. Some recommended
  values are deliberately lower than engine defaults (random_page_cost
  1.1 vs 4.0, autovacuum scale factors). Labelling those as "UNDERSIZED"
  was wrong. Added a regression test that pins random_page_cost as
  MISMATCH when the current value is higher than recommended.

- dbTune --apply: distinguish "no tracked tables on this DB" from "all
  tables already match" in the Nothing-to-apply log line. Mirrors the
  same disambiguation already in DbTuneReport.appendNextSteps.

- DbTuneReport next-steps now points at ./bootstrap/openmetadata-ops.sh
  (the actual location of the wrapper script) instead of the bare
  ./openmetadata-ops.sh which would be file-not-found from the repo root.

- DbTuneIT analyzeReturnsRecommendationsForKnownTables: assertion
  message now references TEST_TABLE so it stays correct after the
  switch from storage_container_entity to entity_relationship.

Tests: 40 unit tests passing (+1 for the higher-than-recommended MISMATCH case).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* DbTuneIT: contain apply path to a private isolated table

The earlier IT had applyChangesReloptionsAndIsIdempotent + analyzeOne
running ALTER TABLE / ANALYZE TABLE against entity_relationship — a
real catalog table that the rest of the suite writes to. On MySQL this
broke ~17 downstream ITs (GlossaryResourceIT, GlossaryTermRelationsIT,
DomainResourceIT, LineageBrokenReferenceIT, OpenLineageLineageResolutionIT)
with "Cannot create a JSON value from a string with CHARACTER SET 'binary'"
on inserts into entity_relationship.json.

Root cause: ALTER TABLE bumps MySQL's per-table metadata version, which
invalidates JDBC prepared-statement caches across the whole shared
Testcontainer. After re-prepare, the JSON column metadata sometimes
comes back as VARBINARY in Connector/J's cache, breaking subsequent
JSON inserts. Confirmed via CI log timing: DbTuneIT finished at
17:58:12, GlossaryTermRelationsIT failed starting 17:58:53.

The recommendations themselves are correct — STATS_PERSISTENT=1,
STATS_AUTO_RECALC=1, STATS_SAMPLE_PAGES=100 are safe and modern InnoDB
defaults except for SAMPLE_PAGES. The test just cannot afford the side
effect on a shared DB.

Fix: introduce dbtune_it_isolated_table created/dropped per test in
@BeforeEach/@AfterEach, build a TableRecommendation for it directly
(bypassing the static catalog), and run apply + analyzeOne against it.
The read-only tests (analyze, dry-run) continue to use the real catalog
since they don't mutate anything.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* db-tune: add --diagnose for read-only DBA findings

The tuning report we already had is a static playbook (table → recipe + row-count gate). Operators rightly asked whether db-tune actually inspects the live database for *its own* signals — unused indexes, bloat, slow queries, cache hit ratios — and surfaces recommendations from those measurements. Until now: no.

This adds a parallel read-only path:

  ./bootstrap/openmetadata-ops.sh db-tune --diagnose

Postgres categories (each isolated in its own try block; missing extension or permissions surfaces in DbTuneDiagnosis.notes rather than failing the run):

- UNUSED_INDEX — pg_stat_user_indexes idx_scan=0, size > 10 MB, non-unique non-pkey
- HIGH_DEAD_TUPLES — n_dead_tup/n_live_tup > 0.2 with n_live_tup > 10k (autovacuum falling behind)
- LOW_CACHE_HIT — heap_blks_read > 1000 AND hit ratio < 90%
- STALE_STATS — last_autoanalyze NULL or > 14 days, n_live_tup > 1000
- SEQ_SCAN_HEAVY — seq_scan/idx_scan > 10 with > 1000 seq scans (suggests missing index)
- SLOW_QUERY — pg_stat_statements top 10 by mean_exec_time, calls > 100 (gracefully skipped if extension absent)

MySQL categories:

- UNUSED_INDEX — sys.schema_unused_indexes filtered to current schema
- LOW_BUFFER_POOL_HIT — Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests < 99%
- SLOW_QUERY — performance_schema.events_statements_summary_by_digest top 10 by avg_timer_wait
- FULL_TABLE_SCAN — sys.statements_with_full_table_scans

Concept stays separate from AutoTuner: Diagnostic is read-only and never participates in --apply. Categories with zero findings are suppressed in the report; notes capture what couldn't be checked. Each finding is structured as (category, severity, attributes) with the attribute keys driven by DiagnosticCategory.columns() so the renderer dispatches a category-specific layout.

Tests: 50 unit tests passing (40 → 50, +10 in DiagnosticReportTest covering EnumMap grouping order, empty/non-empty rendering, suppression, notes appending, query truncation, and column-list immutability). DbTuneIT gains diagnoseCompletesWithoutErrorAndReturnsStructuredResult that exercises the full end-to-end against the live Testcontainer (read-only, safe).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* db-tune: restore apply DB verification + include idx_scan=0 in seqScanHeavy

Two findings from the review round on 2d9f047:

1. DbTuneIT.applyExecutesAndIsIdempotent stopped verifying that apply()
   actually persisted the settings to the DB after the rewrite to an
   isolated test table — it only asserted the ALTER statement
   contained the table name and that apply() didn't throw. The earlier
   shape (assertSettingsMatch) was load-bearing for catching regressions
   where apply() silently no-ops. Restored end-to-end verification.

   Added AutoTuner.currentSettingsForTable(Handle, String) so the test
   can read back settings on a table that is NOT in the static catalog
   (the isolated dbtune_it_isolated_table). Implemented for both
   engines by reusing the existing parseReloptions / parseCreateOptions
   logic. Falls out cleanly from a single pg_class / INFORMATION_SCHEMA
   query; no duplication.

   The IT now: builds the recommendation, asserts the generated SQL is
   well-formed, applies it, reads back via currentSettingsForTable,
   asserts each recommended key/value persisted numerically (handles
   PG lowercase / MySQL uppercase key conventions via case-insensitive
   lookup), then re-applies and asserts the snapshot is byte-equal.

2. PostgresDiagnostic.seqScanHeavy WHERE clause used
   seq_scan::numeric / NULLIF(idx_scan, 0) > :ratio, which silently
   excludes tables with idx_scan=0 — exactly the worst-case candidates
   for a missing index. The Java mapper even has a dead-code branch
   handling idx_scan=0 ("∞" ratio display) that could never trigger.

   Changed predicate to (idx_scan = 0 OR seq_scan::numeric/idx_scan > :ratio).
   The mapper's ∞ branch now actually executes. Inline comment added
   explaining the choice so a future audit doesn't undo it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* db-tune: address review 4264516128 — Enter dismisses prompt, NULL renders blank, ratio uses double

Four findings from the round on 4951a04:

1. confirmApply prompt blocked on bare Enter
   scanner.hasNext()/next() are token-based, so a user pressing Enter to
   accept the [y/N] default would hang waiting for a non-whitespace
   token. Switched to scanner.nextLine() and treat empty input (and EOF)
   as "no" — matches the [y/N] convention.

2. & 3. NULL timestamps rendered as literal "null"
   highDeadTuples.last_vacuum and staleStats.last_analyzed are nullable
   columns; String.valueOf(rs.getString(...)) maps SQL NULL to the four-
   character string "null", which the renderer then prints verbatim in
   the diagnostic table. Added nullSafe() helper that returns "" for
   null and replaced both call sites.

4. seqScanHeavy ratio used integer division
   seq_scan / idx_scan as longs reports 15/2 → 7, masking how
   seq-scan-heavy a table actually is. Extracted formatSeqIdxRatio()
   that does (double) seqScan / idxScan formatted to one decimal, and
   still returns "∞" for the idx_scan=0 worst case. The previous fix to
   include idx_scan=0 in the SQL (4951a04) means the "∞" branch now
   actually fires.

Tests: 53 passing (+3 for the new helpers — nullSafe + formatSeqIdxRatio
across normal, zero, and one-decimal cases).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants