Skip to content

fix: watch Postgres CRs to resolve active-engine race condition#679

Merged
Starefossen merged 3 commits into
masterfrom
fix/watch-postgres-for-active-engine
May 26, 2026
Merged

fix: watch Postgres CRs to resolve active-engine race condition#679
Starefossen merged 3 commits into
masterfrom
fix/watch-postgres-for-active-engine

Conversation

@Starefossen
Copy link
Copy Markdown
Member

Problem

When a new Postgres CR is created, both pgrator and naiserator reconcile simultaneously. If naiserator reads the Postgres CR before pgrator has set the active-engine annotation, naiserator enters a 30-minute retry loop (FailedPrepare):

Error: Status: failure: Application/ssbno-statreg (FailedPrepare): preparing rollout configuration:
waiting for pgrator to set active-engine annotation on ssbno/statreg-api-test-db; will retry

Solution

Add a metadata-only Watch on data.nais.io/v1 Postgres CRs in both the Application and Naisjob controllers. When pgrator persists the active-engine annotation, the watch triggers a re-reconcile of any app/job in the same namespace whose spec.postgres.clusterName matches.

Safety

  • Hash-based skip: The synchronizer's hash check ensures that apps already synced successfully are a no-op on re-reconcile
  • After FailedPrepare: The hash is NOT saved, so the re-triggered reconcile proceeds past the hash check and succeeds
  • Metadata-only watch: Uses WatchesMetadata — no full Postgres spec is fetched, only annotations/labels
  • Namespace-scoped list: The map function only lists apps in the same namespace as the changed Postgres CR
  • Existing RBAC: The helm chart already grants get;list;watch on data.nais.io/postgres

Companion PR

Closes #674

When a new Postgres CR is created, both pgrator and naiserator reconcile
simultaneously. If naiserator reads the Postgres CR before pgrator has
set the active-engine annotation, naiserator enters a 30-minute retry
loop (FailedPrepare).

This adds a metadata-only watch on Postgres CRs that maps annotation
changes back to the owning Application/Naisjob, triggering an immediate
re-reconcile. The hash-based skip in the synchronizer ensures this is
a no-op for apps that are already successfully synced.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Starefossen Starefossen requested a review from a team as a code owner May 22, 2026 08:00
Only react to Postgres CR updates where annotations actually changed.
This avoids triggering namespace-wide app lists on status-only updates,
which matters as the number of Postgres CRs grows across namespaces.

Create/Delete events still pass through (correct behavior).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread pkg/controllers/postgres_watch.go Outdated
…data

No need for a constructor function — it's only used at startup and
controller-runtime doesn't mutate it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Starefossen Starefossen merged commit 0e1dbe3 into master May 26, 2026
2 checks passed
@Starefossen Starefossen deleted the fix/watch-postgres-for-active-engine branch May 26, 2026 07:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

preparePostgres should retry when active-engine annotation is missing

2 participants