Skip to content

Add db-tune ops subcommand + production RDS runbook#27890

Open
harshach wants to merge 6 commits intomainfrom
harshach/rds-perf-doc
Open

Add db-tune ops subcommand + production RDS runbook#27890
harshach wants to merge 6 commits intomainfrom
harshach/rds-perf-doc

Conversation

@harshach
Copy link
Copy Markdown
Collaborator

@harshach harshach commented May 4, 2026

Summary

  • New `./openmetadata-ops.sh db-tune` subcommand: read-only tuning report by default, with `--apply` to execute and `--analyze` to refresh planner stats. Postgres autovacuum reloptions; MySQL InnoDB persistent-stats reloptions. Pure heuristic + formatter is unit-tested; live-DB flow is covered by `DbTuneIT` against the IT bootstrap's Postgres / MySQL Testcontainer.

Sample input / output

Report mode (default — no DB writes)

```
$ ./openmetadata-ops.sh db-tune

Database engine: PostgreSQL 17.2

=== Server-level parameter compliance ===
+---------------------------------+---------+-------------------+-------------+-------------------------+
| Parameter | Current | Recommended | Status | Note |
+---------------------------------+---------+-------------------+-------------+-------------------------+
| shared_buffers | 16384 | 40% of RAM | UNTUNED | RAM-relative; verify |
| effective_cache_size | 524288 | 75% of RAM | UNTUNED | RAM-relative; verify |
| work_mem | 4096 | 131072 | UNDERSIZED | |
| maintenance_work_mem | 65536 | 2097152 | UNDERSIZED | |
| random_page_cost | 4 | 1.1 | UNDERSIZED | |
| effective_io_concurrency | 1 | 200 | UNDERSIZED | |
| max_parallel_workers_per_gather | 2 | 4 | UNDERSIZED | |
| autovacuum_naptime | 60 | 15 | UNDERSIZED | |
| autovacuum_vacuum_scale_factor | 0.2 | 0.05 | UNDERSIZED | |
| autovacuum_analyze_scale_factor | 0.1 | 0.02 | UNDERSIZED | |
+---------------------------------+---------+-------------------+-------------+-------------------------+

These cannot be applied by this tool — change them in your DB parameter group / RDS console.

=== Per-table recommendations (26 tables) ===
+--------------------------+-----------+--------+-----------+----------------------------------------+---------+----------------------------------+
| Table | Rows | Size | Current | Recommended | Action | Reason |
+--------------------------+-----------+--------+-----------+----------------------------------------+---------+----------------------------------+
| entity_relationship | 8,400,000 | 3.2 GB | (default) | autovacuum_analyze_scale_factor=0.005, | APPLY | Join target, write-heavy |
| | | | | autovacuum_vacuum_cost_limit=4000, | | |
| | | | | autovacuum_vacuum_scale_factor=0.01 | | |
| tag_usage | 7,402,343 | 30 GB | autovac.. | autovacuum_analyze_scale_factor=0.005, | TIGHTEN | Hottest table on read path |
| | | | | autovacuum_vacuum_cost_delay=0, | | |
| | | | | autovacuum_vacuum_cost_limit=4000, | | |
| | | | | autovacuum_vacuum_scale_factor=0.01 | | |
| storage_container_entity | 580,123 | 1.5 GB | (default) | autovacuum_analyze_scale_factor=0.01, | APPLY | Large entity table; tighten |
| | | | | autovacuum_vacuum_scale_factor=0.02 | | autovacuum so list count stats |
| | | | | | | stay fresh |
| table_entity | 420,000 | 2.1 GB | (default) | autovacuum_analyze_scale_factor=0.01, | APPLY | Large entity table |
| | | | | autovacuum_vacuum_scale_factor=0.02 | | |
| change_event | 12,400,000| 4.0 GB | (default) | autovacuum_analyze_scale_factor=0.1, | RELAX | Append-only, relax autovacuum |
| | | | | autovacuum_vacuum_scale_factor=0.2 | | |
| dashboard_entity | 8,500 | 32 MB | (default) | (default) | SKIP | Row count 8500 below threshold |
| | | | | | | 10000 |
| api_collection_entity | 0 | 0 KB | (default) | (default) | SKIP | Row count 0 below threshold |
| | | | | | | 10000 |
+--------------------------+-----------+--------+-----------+----------------------------------------+---------+----------------------------------+

Next steps:
./openmetadata-ops.sh db-tune --apply --analyze # apply + refresh planner stats
./openmetadata-ops.sh db-tune --apply # apply only; run analyze-tables later
```

Apply mode (interactive prompt)

```
$ ./openmetadata-ops.sh db-tune --apply

[ ... same report as above ... ]

About to apply 5 ALTER statements:
ALTER TABLE "entity_relationship" SET (autovacuum_analyze_scale_factor = 0.005, autovacuum_vacuum_cost_limit = 4000, autovacuum_vacuum_scale_factor = 0.01);
ALTER TABLE "tag_usage" SET (autovacuum_analyze_scale_factor = 0.005, autovacuum_vacuum_cost_delay = 0, autovacuum_vacuum_cost_limit = 4000, autovacuum_vacuum_scale_factor = 0.01);
ALTER TABLE "storage_container_entity" SET (autovacuum_analyze_scale_factor = 0.01, autovacuum_vacuum_scale_factor = 0.02);
ALTER TABLE "table_entity" SET (autovacuum_analyze_scale_factor = 0.01, autovacuum_vacuum_scale_factor = 0.02);
ALTER TABLE "change_event" SET (autovacuum_analyze_scale_factor = 0.1, autovacuum_vacuum_scale_factor = 0.2);

Apply now? [y/N]: y

+--------------------------+---------+--------+----------+
| Table | Action | Status | Details |
+--------------------------+---------+--------+----------+
| entity_relationship | APPLY | OK | Applied |
| tag_usage | TIGHTEN | OK | Applied |
| storage_container_entity | APPLY | OK | Applied |
| table_entity | APPLY | OK | Applied |
| change_event | RELAX | OK | Applied |
+--------------------------+---------+--------+----------+
```

Apply + refresh planner stats

```
$ ./openmetadata-ops.sh db-tune --apply --yes --analyze

[ ... report ... ]

+--------------------------+---------+--------+--------------------+
| Table | Action | Status | Details |
+--------------------------+---------+--------+--------------------+
| entity_relationship | APPLY | OK | Applied + analyzed |
| tag_usage | TIGHTEN | OK | Applied + analyzed |
| storage_container_entity | APPLY | OK | Applied + analyzed |
| ... | ... | ... | ... |
+--------------------------+---------+--------+--------------------+
```

MySQL output

```
$ ./openmetadata-ops.sh db-tune

Database engine: MySQL 8.0.36

=== Server-level parameter compliance ===
+--------------------------------------+----------+-----------------------+-------------+
| Parameter | Current | Recommended | Status |
+--------------------------------------+----------+-----------------------+-------------+
| innodb_buffer_pool_size | 1.0 GB | 40-60% of RAM | UNTUNED |
| innodb_io_capacity | 200 | 2000 | UNDERSIZED |
| innodb_io_capacity_max | 2000 | 4000 | UNDERSIZED |
| innodb_stats_persistent_sample_pages | 20 | 64 | UNDERSIZED |
| ... | ... | ... | ... |
+--------------------------------------+----------+-----------------------+-------------+

=== Per-table recommendations (25 tables) ===
+--------------------------+-----------+--------+-----------+--------------------------+--------+----------------------------------+
| Table | Rows | Size | Current | Recommended | Action | Reason |
+--------------------------+-----------+--------+-----------+--------------------------+--------+----------------------------------+
| storage_container_entity | 580,123 | 1.5 GB | (default) | STATS_AUTO_RECALC=1, | APPLY | Large entity table; bump InnoDB |
| | | | | STATS_PERSISTENT=1, | | stats sampling |
| | | | | STATS_SAMPLE_PAGES=64 | | |
| ... | ... | ... | ... | ... | ... | ... |
+--------------------------+-----------+--------+-----------+--------------------------+--------+----------------------------------+
```

The `--apply` SQL on MySQL uses `ALTER TABLE \`name\` STATS_PERSISTENT=1, STATS_AUTO_RECALC=1, STATS_SAMPLE_PAGES=64` (no parens, no equals-spaces — MySQL syntax).

Architecture

```
openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/
├── Action.java # APPLY / TIGHTEN / RELAX / OK / SKIP
├── TableStats.java # input record (rows, bytes, current settings)
├── TableRecommendation.java # output record (action, current, recommended, reason)
├── ServerParamCheck.java # per-parameter compliance row
├── DbTuneResult.java # bundle: engine, version, params, recommendations
├── AutoTuner.java # interface: analyze / recommend / apply / analyzeOne
├── PostgresTuningCatalog.java # tier map: HOT_RELATIONSHIP, HOT_TAG_USAGE, ENTITY_LARGE, ENTITY_SERVICE, APPEND_ONLY
├── PostgresAutoTuner.java # autovacuum reloptions + ANALYZE
├── MysqlTuningCatalog.java # tier map: HOT, ENTITY_LARGE, ENTITY_SERVICE
├── MysqlAutoTuner.java # InnoDB STATS_* + ANALYZE TABLE
└── DbTuneReport.java # ASCII renderer + ALTER-statement preview
```

`recommend(stats)` is a pure function — input is observed table state, output is the action. Tested without a database. `apply()` and `analyzeOne()` are the I/O methods, tested in `DbTuneIT` against a live Testcontainer.

Heuristic

Tier Tables Recommended (Postgres) Recommended (MySQL) Row threshold
HOT entity_relationship, tag_usage scale=0.005/0.01, cost_limit=4000 (+cost_delay=0 for tag_usage) STATS_SAMPLE_PAGES=100 0
ENTITY_LARGE 22 entity tables scale=0.01/0.02 STATS_SAMPLE_PAGES=64 10,000
ENTITY_SERVICE database_entity, database_schema_entity scale=0.02/0.05 STATS_SAMPLE_PAGES=32 5,000
APPEND_ONLY change_event scale=0.1/0.2 (RELAX) (n/a — InnoDB has no relax knob) 50,000

Tests

  • 38 unit tests, all green: `PostgresAutoTunerTest` (20), `MysqlAutoTunerTest` (11), `DbTuneReportTest` (7).
  • `DbTuneIT` runs against the IT bootstrap's Testcontainer (Postgres by default; `-DdatabaseType=mysql` for MySQL): asserts `analyze()` returns recommendations, `apply()` writes reloptions and is idempotent, `analyzeOne()` runs without error, dry-run analyze does not mutate settings. Resets the modified table options in `finally`.

Test plan

  • CI: unit tests pass (`mvn test -Dtest='PostgresAutoTunerTest,MysqlAutoTunerTest,DbTuneReportTest'`)
  • CI: `DbTuneIT` passes against the Postgres profile
  • CI: `DbTuneIT` passes against the MySQL profile (`-DdatabaseType=mysql`)
  • Local: `./openmetadata-ops.sh db-tune` against a populated dev DB — verify report shape + sane recommendations
  • Local: `./openmetadata-ops.sh db-tune --apply --yes` followed by a second `db-tune` shows everything as OK (idempotency)
  • Local: `./openmetadata-ops.sh db-tune --apply --yes --analyze` completes without error and refreshes planner stats

🤖 Generated with Claude Code


Summary by Gitar

  • Added diagnostic ops:
    • Added --diagnose flag to db-tune for read-only DBA reports (unused indexes, bloat, slow queries).
    • Implemented PostgresDiagnostic and MysqlDiagnostic using sys, performance_schema, and pg_stat_* views.
  • Diagnostic reporting architecture:
    • Introduced DbTuneDiagnosis and DiagnosticCategory to support structured reporting and predictable table layouts.
    • Added DiagnosticReportTest for logic coverage of findings grouping and report rendering.
  • Enhanced IT coverage:
    • Added diagnoseCompletesWithoutError to DbTuneIT to verify end-to-end execution against live Testcontainers.

This will update automatically on new commits.

Generate a per-table autovacuum (Postgres) / InnoDB stats (MySQL)
tuning report from the live database; --apply executes the ALTER
statements after operator confirmation; --analyze refreshes planner
stats on the changed tables. Heuristic skips small dev tables
(row-count gated) and recognizes already-tightened settings as OK.

Catalog values come from the 600k-container production audit. The
companion runbook in docs/rds-postgres-production-runbook.md captures
the parameter-group settings, extensions, and the indexes that must be
applied by hand because the migration runner can't dedup cross-version.

Tests: 38 unit tests on the heuristic + report formatter; DbTuneIT
covers analyze / apply / analyzeOne / dry-run against the IT
bootstrap's live Postgres or MySQL Testcontainer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 4, 2026 18:55
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels May 4, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an operator-facing db-tune ops subcommand plus a production RDS Postgres runbook, backed by a new dbtune Java utility package that can analyze table stats, produce tuning recommendations, and (optionally) apply per-table settings and refresh planner stats for Postgres/MySQL.

Changes:

  • Introduces org.openmetadata.service.util.dbtune (catalogs, engine-specific tuners, report renderer, result models).
  • Wires a new db-tune subcommand into OpenMetadataOperations with --apply/--yes/--analyze.
  • Adds unit tests for heuristic/report formatting and an integration test (DbTuneIT), plus an RDS Postgres production runbook.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
openmetadata-service/src/main/java/org/openmetadata/service/util/OpenMetadataOperations.java Adds db-tune CLI command, prompts, and apply/analyze execution flow
openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/Action.java Defines actionable vs non-actionable recommendation actions
openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/AutoTuner.java Defines the engine-specific tuner contract (analyze/recommend/apply/analyzeOne/sql builder)
openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/DbTuneReport.java Renders ASCII report output and ALTER statement previews
openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/DbTuneResult.java Bundles engine/version + server param checks + per-table recommendations
openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/MysqlAutoTuner.java Implements MySQL stats/options inspection, recommendations, and ALTER/ANALYZE generation
openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/MysqlTuningCatalog.java Maps MySQL tables to stats sampling “profiles” and thresholds
openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/PostgresAutoTuner.java Implements Postgres reloptions inspection, recommendations, and ALTER/ANALYZE generation
openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/PostgresTuningCatalog.java Maps Postgres tables to autovacuum “profiles” and thresholds
openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/ServerParamCheck.java Models server-level parameter compliance rows
openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/TableRecommendation.java Models per-table recommendation outputs
openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/TableStats.java Models per-table observed stats/settings inputs
openmetadata-service/src/test/java/org/openmetadata/service/util/dbtune/DbTuneReportTest.java Unit tests for report formatting helpers and sections
openmetadata-service/src/test/java/org/openmetadata/service/util/dbtune/MysqlAutoTunerTest.java Unit tests for MySQL heuristic decisions, parsing, and SQL formatting
openmetadata-service/src/test/java/org/openmetadata/service/util/dbtune/PostgresAutoTunerTest.java Unit tests for Postgres heuristic decisions, parsing, quoting, and SQL formatting
openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/DbTuneIT.java Integration test exercising analyze/apply/analyze-one against Testcontainers DB
docs/rds-postgres-production-runbook.md Consolidated ops runbook for RDS Postgres parameter groups, extensions, table overrides, and manual indexes

Comment on lines +33 to +41
/**
* End-to-end tests for {@link AutoTuner} against the live Testcontainers database. The bootstrap
* runs every migration up to the current version, so the tracked entity tables (e.g.
* {@code storage_container_entity}) exist; we exercise the analyze → apply → analyze-one path
* against the real schema and reset the modified reloptions / table options at the end.
*
* <p>Sequential because each test mutates table-level reloptions on shared production tables;
* parallel execution would race between read-stats and apply.
*/
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a719d47: added @execution(ExecutionMode.SAME_THREAD) on the class so the tests don't race on shared reloptions when the suite runs in parallel.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

🔴 Playwright Results — 2 failure(s), 11 flaky

✅ 3991 passed · ❌ 2 failed · 🟡 11 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 298 0 1 4
🔴 Shard 2 744 2 4 8
🟡 Shard 3 753 0 2 7
🟡 Shard 4 774 0 1 18
🟡 Shard 5 686 0 1 41
🟡 Shard 6 736 0 2 8

Genuine Failures (failed on all attempts)

Features/Container.spec.ts › Container page children pagination (shard 2)
�[31mTest timeout of 180000ms exceeded.�[39m
Features/Glossary/GlossaryWorkflow.spec.ts › should start term as Draft when glossary has reviewers (shard 2)
Error: �[2mexpect(�[22m�[31mlocator�[39m�[2m).�[22mtoHaveText�[2m(�[22m�[32mexpected�[39m�[2m)�[22m failed

Locator:  locator('[data-row-key*="DraftTerm1778032067151"]').locator('.status-badge')
Expected: �[32m"Draft"�[39m
Received: �[31m"In Review"�[39m
Timeout:  15000ms

Call log:
�[2m  - Expect "toHaveText" with timeout 15000ms�[22m
�[2m  - waiting for locator('[data-row-key*="DraftTerm1778032067151"]').locator('.status-badge')�[22m
�[2m    18 × locator resolved to <div class="status-badge inReview" data-testid=""PW%'a551e0a7.Merrya9c086c9".DraftTerm1778032067151-status">…</div>�[22m
�[2m       - unexpected value "In Review"�[22m

🟡 11 flaky test(s) (passed on retry)
  • Pages/Bots.spec.ts › Bots Page should work properly (shard 1, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event is created when owner is added (shard 2, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/DataProductPersonaCustomization.spec.ts › Data Product - customize tab label should only render if it's customized by user (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/UserProfileOnlineStatus.spec.ts › Should show "Active recently" for users active within last hour (shard 3, 1 retry)
  • Pages/CustomProperties.spec.ts › Set & Update all CP types on apiCollection (shard 4, 1 retry)
  • Pages/EntityDataSteward.spec.ts › Tier Add, Update and Remove (shard 5, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Teams.spec.ts › Teams Page Flow (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Copilot AI review requested due to automatic review settings May 5, 2026 17:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

…/match disambiguation, script path

- ServerParamCheck.STATUS_UNDERSIZED -> STATUS_MISMATCH. Some recommended
  values are deliberately lower than engine defaults (random_page_cost
  1.1 vs 4.0, autovacuum scale factors). Labelling those as "UNDERSIZED"
  was wrong. Added a regression test that pins random_page_cost as
  MISMATCH when the current value is higher than recommended.

- dbTune --apply: distinguish "no tracked tables on this DB" from "all
  tables already match" in the Nothing-to-apply log line. Mirrors the
  same disambiguation already in DbTuneReport.appendNextSteps.

- DbTuneReport next-steps now points at ./bootstrap/openmetadata-ops.sh
  (the actual location of the wrapper script) instead of the bare
  ./openmetadata-ops.sh which would be file-not-found from the repo root.

- DbTuneIT analyzeReturnsRecommendationsForKnownTables: assertion
  message now references TEST_TABLE so it stays correct after the
  switch from storage_container_entity to entity_relationship.

Tests: 40 unit tests passing (+1 for the higher-than-recommended MISMATCH case).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@harshach
Copy link
Copy Markdown
Collaborator Author

harshach commented May 5, 2026

Round 2 review (4230257768) — addressed all four findings in f5a945a.

Finding Fix
--apply empty == "already matches" is wrong when 0 tracked tables exist Branched: logs "no tracked tables exist on this database" vs "already matches" so the message matches what DbTuneReport.appendNextSteps already does.
Next-step snippet uses ./openmetadata-ops.sh (wrong path from repo root) Changed to ./bootstrap/openmetadata-ops.sh to match the actual script location.
Stale assertion message in DbTuneIT after switching TEST_TABLE Now interpolates TEST_TABLE so the message stays accurate if the constant moves again.
STATUS_UNDERSIZED is misleading when recommended is lower than default (e.g. random_page_cost = 1.1 vs default 4.0) Renamed to STATUS_MISMATCH (direction-agnostic) with a Javadoc note explaining why. New regression test buildServerCheck_currentHigherThanRecommendedIsAlsoMismatch pins the random_page_cost(current=4.0, recommended=1.1) → MISMATCH case.

Unit tests: 40 passing (was 39, +1 for the new direction-agnostic mismatch case).

🤖 Generated with Claude Code

The earlier IT had applyChangesReloptionsAndIsIdempotent + analyzeOne
running ALTER TABLE / ANALYZE TABLE against entity_relationship — a
real catalog table that the rest of the suite writes to. On MySQL this
broke ~17 downstream ITs (GlossaryResourceIT, GlossaryTermRelationsIT,
DomainResourceIT, LineageBrokenReferenceIT, OpenLineageLineageResolutionIT)
with "Cannot create a JSON value from a string with CHARACTER SET 'binary'"
on inserts into entity_relationship.json.

Root cause: ALTER TABLE bumps MySQL's per-table metadata version, which
invalidates JDBC prepared-statement caches across the whole shared
Testcontainer. After re-prepare, the JSON column metadata sometimes
comes back as VARBINARY in Connector/J's cache, breaking subsequent
JSON inserts. Confirmed via CI log timing: DbTuneIT finished at
17:58:12, GlossaryTermRelationsIT failed starting 17:58:53.

The recommendations themselves are correct — STATS_PERSISTENT=1,
STATS_AUTO_RECALC=1, STATS_SAMPLE_PAGES=100 are safe and modern InnoDB
defaults except for SAMPLE_PAGES. The test just cannot afford the side
effect on a shared DB.

Fix: introduce dbtune_it_isolated_table created/dropped per test in
@BeforeEach/@AfterEach, build a TableRecommendation for it directly
(bypassing the static catalog), and run apply + analyzeOne against it.
The read-only tests (analyze, dry-run) continue to use the real catalog
since they don't mutate anything.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 5, 2026 23:19
Comment on lines +117 to +123
jdbi.useHandle(handle -> tuner.apply(handle, rec));

String built = tuner.buildAlterStatement(rec);
assertTrue(built.contains(ISOLATED_TABLE), "ALTER target table mismatch: " + built);

// Apply twice — second invocation must complete without throwing.
jdbi.useHandle(handle -> tuner.apply(handle, rec));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: applyExecutesAndIsIdempotent no longer verifies DB state

The old test used assertSettingsMatch to confirm settings were actually persisted in the database after apply(). The new test only verifies that buildAlterStatement returns a string containing the table name and that apply() doesn't throw. While the rationale for using an isolated table is solid (avoiding MySQL metadata invalidation), the test could still read back the settings from the isolated table to confirm they took effect — there's no risk of side-effects on shared tables since ISOLATED_TABLE is private to this test.

This would restore the end-to-end confidence that apply() actually mutates the DB, rather than just not crashing.

Suggested fix:

// After the first apply, verify the settings were actually written:
Map<String, String> after = currentSettingsFor(tuner, jdbi, ISOLATED_TABLE);
assertSettingsMatch(rec.recommendedSettings(), after);

// Apply twice — must be idempotent
jdbi.useHandle(handle -> tuner.apply(handle, rec));
Map<String, String> afterSecond = currentSettingsFor(tuner, jdbi, ISOLATED_TABLE);
assertEquals(after, afterSecond, "Apply should be idempotent");

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

Comment on lines +116 to +125

jdbi.useHandle(handle -> tuner.apply(handle, rec));

String built = tuner.buildAlterStatement(rec);
assertTrue(built.contains(ISOLATED_TABLE), "ALTER target table mismatch: " + built);

// Apply twice — second invocation must complete without throwing.
jdbi.useHandle(handle -> tuner.apply(handle, rec));
}

* <li>Read observed table stats and current parameter-group settings from the database.
* <li>Compute a recommended table-level reloption set per table (pure logic — see
* {@link #recommend(TableStats)}).
* <li>Apply the recommendations via {@code ALTER TABLE ... SET (...)} when the operator opts in.
Comment on lines +25 to +27
* the planner picks the right index against multi-GB JSONB heaps. {@code STATS_PERSISTENT=1} +
* {@code STATS_AUTO_RECALC=1} are the modern InnoDB defaults; we assert them explicitly so a
* tenant with stale my.cnf overrides converges.
Comment on lines +2481 to +2487
@Command(
name = "db-tune",
description =
"Generate a per-table autovacuum / InnoDB stats tuning report and optionally apply it. "
+ "Default mode is read-only — pass --apply to execute the ALTER TABLE statements "
+ "and --analyze to refresh planner stats on changed tables.")
public Integer dbTune(
The tuning report we already had is a static playbook (table → recipe + row-count gate). Operators rightly asked whether db-tune actually inspects the live database for *its own* signals — unused indexes, bloat, slow queries, cache hit ratios — and surfaces recommendations from those measurements. Until now: no.

This adds a parallel read-only path:

  ./bootstrap/openmetadata-ops.sh db-tune --diagnose

Postgres categories (each isolated in its own try block; missing extension or permissions surfaces in DbTuneDiagnosis.notes rather than failing the run):

- UNUSED_INDEX — pg_stat_user_indexes idx_scan=0, size > 10 MB, non-unique non-pkey
- HIGH_DEAD_TUPLES — n_dead_tup/n_live_tup > 0.2 with n_live_tup > 10k (autovacuum falling behind)
- LOW_CACHE_HIT — heap_blks_read > 1000 AND hit ratio < 90%
- STALE_STATS — last_autoanalyze NULL or > 14 days, n_live_tup > 1000
- SEQ_SCAN_HEAVY — seq_scan/idx_scan > 10 with > 1000 seq scans (suggests missing index)
- SLOW_QUERY — pg_stat_statements top 10 by mean_exec_time, calls > 100 (gracefully skipped if extension absent)

MySQL categories:

- UNUSED_INDEX — sys.schema_unused_indexes filtered to current schema
- LOW_BUFFER_POOL_HIT — Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests < 99%
- SLOW_QUERY — performance_schema.events_statements_summary_by_digest top 10 by avg_timer_wait
- FULL_TABLE_SCAN — sys.statements_with_full_table_scans

Concept stays separate from AutoTuner: Diagnostic is read-only and never participates in --apply. Categories with zero findings are suppressed in the report; notes capture what couldn't be checked. Each finding is structured as (category, severity, attributes) with the attribute keys driven by DiagnosticCategory.columns() so the renderer dispatches a category-specific layout.

Tests: 50 unit tests passing (40 → 50, +10 in DiagnosticReportTest covering EnumMap grouping order, empty/non-empty rendering, suppression, notes appending, query truncation, and column-list immutability). DbTuneIT gains diagnoseCompletesWithoutErrorAndReturnsStructuredResult that exercises the full end-to-end against the live Testcontainer (read-only, safe).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@harshach
Copy link
Copy Markdown
Collaborator Author

harshach commented May 5, 2026

Added `--diagnose` in 2d9f047. The tuning report stays static (curated playbook of known-hot tables); the new diagnostic path is the data-driven half — it reads the live database for unused indexes, bloat, slow queries, cache hit, etc., and surfaces what it finds.

Usage

```
./bootstrap/openmetadata-ops.sh db-tune # tuning report only (static)
./bootstrap/openmetadata-ops.sh db-tune --diagnose # tuning + read-only diagnostic
./bootstrap/openmetadata-ops.sh db-tune --apply --analyze --diagnose # all of the above
```

`--diagnose` is purely read-only. It never participates in `--apply` — diagnostic findings (e.g. "this index has 0 scans and is 1.2 GB") require human review before any DROP, so we deliberately don't auto-act on them.

Postgres categories

Category Threshold Source
UNUSED_INDEX `idx_scan = 0` AND size > 10 MB, non-unique non-pkey `pg_stat_user_indexes`
HIGH_DEAD_TUPLES `n_dead_tup/n_live_tup > 0.2` with `n_live_tup > 10k` `pg_stat_user_tables`
LOW_CACHE_HIT reads > 1000 AND hit ratio < 90% `pg_statio_user_tables`
STALE_STATS last_autoanalyze NULL or > 14 days `pg_stat_user_tables`
SEQ_SCAN_HEAVY `seq_scan/idx_scan > 10` with > 1000 seq scans `pg_stat_user_tables`
SLOW_QUERY top 10 by `mean_exec_time`, calls > 100 `pg_stat_statements` (note if absent)

MySQL categories

Category Threshold Source
UNUSED_INDEX scans=0 in current schema `sys.schema_unused_indexes`
LOW_BUFFER_POOL_HIT hit ratio < 99% `performance_schema.global_status`
SLOW_QUERY top 10 by `avg_timer_wait` in current schema `performance_schema.events_statements_summary_by_digest`
FULL_TABLE_SCAN top 10 by `exec_count` `sys.statements_with_full_table_scans`

Sample output (Postgres)

```
=== Diagnostic findings ===

Unused indexes (3 found):
Indexes with zero scans since last stats reset; candidates for DROP after a usage review.
+--------------------------+-----------------------------------+--------+--------+
| table | index | size | scans |
+--------------------------+-----------------------------------+--------+--------+
| tag_usage | idx_tag_usage_target_source | 3.2 GB | 0 |
| tag_usage | idx_tag_usage_target_exact | 3.3 GB | 0 |
| tag_usage | idx_tag_usage_join_source | 3.6 GB | 0 |
+--------------------------+-----------------------------------+--------+--------+

Tables with low cache hit ratio (1 found):
Heap reads exceed 1000 with hit ratio < 90%; suggests undersized buffers or hot seq scans.
+-----------+-------------+-------------+---------+
| table | heap_reads | heap_hits | hit_pct |
+-----------+-------------+-------------+---------+
| tag_usage | 292,973,695 | 950,528,269 | 76.44% |
+-----------+-------------+-------------+---------+

Top slowest queries (10 found):
From pg_stat_statements / events_statements_summary_by_digest. Truncated to 100 chars.
+----------------------------------------------------------------------------------------------------+--------+----------+
| query | calls | mean_ms |
+----------------------------------------------------------------------------------------------------+--------+----------+
| SELECT ... FROM container_entity ce LEFT JOIN entity_relationship er ON ... WHERE er.toId IS NULL | 4823 | 2845.7 |
| ... | ... | ... |
+----------------------------------------------------------------------------------------------------+--------+----------+

Notes:

  • (none — all checks succeeded)
    ```

The first three rows mirror exactly the slack-thread audit: 9 of 12 `tag_usage` indexes had 0 scans and consumed ~24 GB combined. `db-tune --diagnose` would have flagged that automatically.

Architecture

```
dbtune/
├── Diagnostic.java # interface: DbTuneDiagnosis diagnose(Handle)
├── DbTuneDiagnosis.java # record(findings, notes); EnumMap-grouped accessor
├── DiagnosticCategory.java # enum with title, description, column layout per category
├── Finding.java # record(category, severity, attributes Map)
├── Severity.java # INFO, WARN
├── PostgresDiagnostic.java # 6 categories, each in its own try block for graceful degrade
└── MysqlDiagnostic.java # 4 categories
```

`Diagnostic` is intentionally separate from `AutoTuner` — diagnostics are read-only by definition and shouldn't be conflated with the apply path. Categories with zero findings are suppressed in the report; `notes` lists what couldn't be checked (missing extension, view permissions, etc.).

Tests

  • 50 unit tests passing (40 → 50, +10 in `DiagnosticReportTest`): EnumMap grouping order, empty-vs-populated render, category section suppression, notes appending, query truncation, and column-list immutability.
  • DbTuneIT gains `diagnoseCompletesWithoutErrorAndReturnsStructuredResult` — exercises the full diagnose path end-to-end against the live Testcontainer. Read-only, no shared-table side effects.

🤖 Generated with Claude Code

Comment on lines +183 to +197
List<Finding> seqScanHeavy(final Handle handle) {
return handle
.createQuery(
"SELECT relname AS table_name, seq_scan, idx_scan "
+ "FROM pg_stat_user_tables "
+ "WHERE seq_scan > :min_seq "
+ " AND seq_scan::numeric / NULLIF(idx_scan, 0) > :ratio "
+ "ORDER BY seq_scan DESC "
+ "LIMIT 25")
.bind("min_seq", SEQ_SCAN_MIN)
.bind("ratio", SEQ_SCAN_RATIO)
.map(
(rs, ctx) ->
new Finding(
DiagnosticCategory.SEQ_SCAN_HEAVY,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Edge Case: seqScanHeavy excludes worst-case tables (idx_scan=0)

The SQL WHERE clause uses seq_scan::numeric / NULLIF(idx_scan, 0) > :ratio. When idx_scan = 0, NULLIF returns NULL making the whole comparison NULL (falsy), so those rows are excluded from results. Ironically, tables with many sequential scans and zero index scans are the most concerning candidates for a missing index, yet they'll never appear in findings.

The Java mapper at line 204 has a dead-code branch handling idx_scan == 0 (returns "∞") that can never execute given the query filter.

Suggested fix:

Change the WHERE clause to handle idx_scan=0 explicitly:

"WHERE seq_scan > :min_seq "
+ "  AND (idx_scan = 0 OR seq_scan::numeric / idx_scan > :ratio) "

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 5, 2026

Code Review 👍 Approved with suggestions 1 resolved / 3 findings

Adds a db-tune ops subcommand for database performance diagnostics and tuning, resolving an NPE in the catch block. Verify the DB state persistence in applyExecutesAndIsIdempotent and consider including tables with zero index scans in the seqScanHeavy analysis.

💡 Quality: applyExecutesAndIsIdempotent no longer verifies DB state

📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/DbTuneIT.java:117-123

The old test used assertSettingsMatch to confirm settings were actually persisted in the database after apply(). The new test only verifies that buildAlterStatement returns a string containing the table name and that apply() doesn't throw. While the rationale for using an isolated table is solid (avoiding MySQL metadata invalidation), the test could still read back the settings from the isolated table to confirm they took effect — there's no risk of side-effects on shared tables since ISOLATED_TABLE is private to this test.

This would restore the end-to-end confidence that apply() actually mutates the DB, rather than just not crashing.

Suggested fix
// After the first apply, verify the settings were actually written:
Map<String, String> after = currentSettingsFor(tuner, jdbi, ISOLATED_TABLE);
assertSettingsMatch(rec.recommendedSettings(), after);

// Apply twice — must be idempotent
jdbi.useHandle(handle -> tuner.apply(handle, rec));
Map<String, String> afterSecond = currentSettingsFor(tuner, jdbi, ISOLATED_TABLE);
assertEquals(after, afterSecond, "Apply should be idempotent");
💡 Edge Case: seqScanHeavy excludes worst-case tables (idx_scan=0)

📄 openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/PostgresDiagnostic.java:183-197

The SQL WHERE clause uses seq_scan::numeric / NULLIF(idx_scan, 0) > :ratio. When idx_scan = 0, NULLIF returns NULL making the whole comparison NULL (falsy), so those rows are excluded from results. Ironically, tables with many sequential scans and zero index scans are the most concerning candidates for a missing index, yet they'll never appear in findings.

The Java mapper at line 204 has a dead-code branch handling idx_scan == 0 (returns "∞") that can never execute given the query filter.

Suggested fix
Change the WHERE clause to handle idx_scan=0 explicitly:

"WHERE seq_scan > :min_seq "
+ "  AND (idx_scan = 0 OR seq_scan::numeric / idx_scan > :ratio) "
✅ 1 resolved
Edge Case: NPE in catch block when exception message is null

📄 openmetadata-service/src/main/java/org/openmetadata/service/util/OpenMetadataOperations.java:2570
List.of() does not accept null elements. If the caught exception has a null message (e.g., a raw NullPointerException or IllegalStateException without a message string), e.getMessage() returns null and List.of(...) throws a secondary NullPointerException, swallowing the original error and producing an unhelpful stack trace.

🤖 Prompt for agents
Code Review: Adds a db-tune ops subcommand for database performance diagnostics and tuning, resolving an NPE in the catch block. Verify the DB state persistence in `applyExecutesAndIsIdempotent` and consider including tables with zero index scans in the `seqScanHeavy` analysis.

1. 💡 Quality: applyExecutesAndIsIdempotent no longer verifies DB state
   Files: openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/DbTuneIT.java:117-123

   The old test used `assertSettingsMatch` to confirm settings were actually persisted in the database after `apply()`. The new test only verifies that `buildAlterStatement` returns a string containing the table name and that `apply()` doesn't throw. While the rationale for using an isolated table is solid (avoiding MySQL metadata invalidation), the test could still read back the settings from the isolated table to confirm they took effect — there's no risk of side-effects on shared tables since `ISOLATED_TABLE` is private to this test.
   
   This would restore the end-to-end confidence that `apply()` actually mutates the DB, rather than just not crashing.

   Suggested fix:
   // After the first apply, verify the settings were actually written:
   Map<String, String> after = currentSettingsFor(tuner, jdbi, ISOLATED_TABLE);
   assertSettingsMatch(rec.recommendedSettings(), after);
   
   // Apply twice — must be idempotent
   jdbi.useHandle(handle -> tuner.apply(handle, rec));
   Map<String, String> afterSecond = currentSettingsFor(tuner, jdbi, ISOLATED_TABLE);
   assertEquals(after, afterSecond, "Apply should be idempotent");

2. 💡 Edge Case: seqScanHeavy excludes worst-case tables (idx_scan=0)
   Files: openmetadata-service/src/main/java/org/openmetadata/service/util/dbtune/PostgresDiagnostic.java:183-197

   The SQL WHERE clause uses `seq_scan::numeric / NULLIF(idx_scan, 0) > :ratio`. When `idx_scan = 0`, `NULLIF` returns NULL making the whole comparison NULL (falsy), so those rows are excluded from results. Ironically, tables with many sequential scans and *zero* index scans are the most concerning candidates for a missing index, yet they'll never appear in findings.
   
   The Java mapper at line 204 has a dead-code branch handling `idx_scan == 0` (returns "∞") that can never execute given the query filter.

   Suggested fix:
   Change the WHERE clause to handle idx_scan=0 explicitly:
   
   "WHERE seq_scan > :min_seq "
   + "  AND (idx_scan = 0 OR seq_scan::numeric / idx_scan > :ratio) "

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 6, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants