fix: prevent job cancellation on transient database errors during entitlement checks by bhellema · Pull Request #2319 · adobe/spacecat-audit-worker

bhellema · 2026-04-06T17:40:59Z

Preflight jobs were incorrectly cancelled when database connection pool
timeouts (PGRST003) occurred during entitlement validation. The error
handling treated all TierClient failures as "site not entitled" and
returned false, causing AsyncJobRunner to set job status to CANCELLED
instead of allowing Lambda/SQS to retry.

Root cause: checkProductCodeEntitlements() caught all errors and
returned false, including transient infrastructure failures like:

PGRST000-003: PostgREST connection/timeout errors (503/504)
Network errors: ECONNREFUSED, ETIMEDOUT, ENOTFOUND
HTTP server errors: 429, 500, 502, 503, 504

Solution: Created isTransientTierClientError() classifier to distinguish
transient errors (should retry) from permanent errors (not entitled).
Transient errors now throw to trigger Lambda/SQS retry mechanism instead
of incorrectly cancelling jobs.

Changes:

Added tier-client-error-classifier.js with error classification logic
Updated checkProductCodeEntitlements() to rethrow transient errors
Updated checkSiteRequiresValidation() to rethrow transient errors
Updated isPaidLLMOCustomer() to rethrow transient errors
Added comprehensive test coverage for all error scenarios
Documented scope analysis confirming fix addresses all vulnerable areas

…itlement checks Preflight jobs were incorrectly cancelled when database connection pool timeouts (PGRST003) occurred during entitlement validation. The error handling treated all TierClient failures as "site not entitled" and returned false, causing AsyncJobRunner to set job status to CANCELLED instead of allowing Lambda/SQS to retry. Root cause: checkProductCodeEntitlements() caught all errors and returned false, including transient infrastructure failures like: - PGRST000-003: PostgREST connection/timeout errors (503/504) - Network errors: ECONNREFUSED, ETIMEDOUT, ENOTFOUND - HTTP server errors: 429, 500, 502, 503, 504 Solution: Created isTransientTierClientError() classifier to distinguish transient errors (should retry) from permanent errors (not entitled). Transient errors now throw to trigger Lambda/SQS retry mechanism instead of incorrectly cancelling jobs. Changes: - Added tier-client-error-classifier.js with error classification logic - Updated checkProductCodeEntitlements() to rethrow transient errors - Updated checkSiteRequiresValidation() to rethrow transient errors - Updated isPaidLLMOCustomer() to rethrow transient errors - Added comprehensive test coverage for all error scenarios - Documented scope analysis confirming fix addresses all vulnerable areas

codecov · 2026-04-06T17:42:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

…itlement checks

github-actions · 2026-04-06T17:44:01Z

This PR will trigger a patch release when merged.

ekremney

Review

Critical: Classifier won't match real PGRST003 errors

I traced the error propagation end-to-end through the data-access layer and TierClient.

PostgREST returns 504 → @supabase/postgrest-js returns { error: { code: 'PGRST003', message: 'Timed out...' }, data: null } → base.collection.js#logAndThrowError('Failed to query', pgrstError) → throws DataAccessError

The DataAccessError that reaches the catch blocks looks like:

{
  name: 'DataAccessError',
  message: 'Failed to query',              // ← generic wrapper
  details: { entityName: 'Entitlement', tableName: 'entitlements' },
  cause: {                                  // ← the real info is HERE
    code: 'PGRST003',
    message: 'Timed out acquiring connection from connection pool',
    details: '...',
    hint: '...'
  }
}

TierClient re-throws this unchanged (tier-client.js:127).

The classifier checks:

error.code → undefined (DataAccessError has no .code)
error.statusCode / error.status → undefined
error.message → 'Failed to query' — does NOT contain 'connection pool', 'timed out', etc.

None of the transient checks match. The classifier returns false, and the bug remains unfixed. The PGRST code and meaningful message live at error.cause.code and error.cause.message, but the classifier never looks there.

The data-access layer itself already has the correct pattern — #isInvalidInputError in base.collection.js walks the .cause chain:

#isInvalidInputError(error) {
  let current = error;
  while (current) {
    if (current?.code === '22P02') return true;
    current = current.cause;
  }
  return false;
}

The classifier needs to do the same.

Critical: Tests use wrong error shapes

All tests construct errors like:

const dbError = new Error('Timed out acquiring connection from connection pool');
dbError.code = 'PGRST003';

But real errors are DataAccessError where .code is undefined and .message is 'Failed to query'. The tests pass because they use fabricated error shapes that don't match production, masking the classifier bug above.

Medium: `checkSiteRequiresValidation` throw is silently caught

In src/index.js:296-310:

try {
  const site = await Site.findById(siteId);
  if (site) {
    const requiresValidation = await checkSiteRequiresValidation(site, context, type);
    site.requiresValidation = requiresValidation;
    context.site = site;
  }
} catch (e) {
  if (!siteId.startsWith('warmup-site-')) {
    log.warn(`Failed to fetch site ${siteId}: ${e.message}`);
  }
}

If checkSiteRequiresValidation throws (per the new change), the catch here swallows it. No Lambda retry happens. The audit handler still runs, but now context.site is never assigned (the throw happens before line 303-304). This is a behavioral change — previously the site was always set on context; now on transient errors it won't be. Downstream impact depends on what handlers do when context.site is undefined.

Medium: Outer catch in `checkProductCodeEntitlements` semantics changed

} catch (error) {
  // Old: return false (cancel job)
  // New: throw error (retry)
  context.log.error('Transient error in entitlement check, job will retry:', error);
  throw error;
}

The comment says "a transient error was thrown from inner loop" but any error landing here is now rethrown — including non-transient unexpected errors (e.g., from Promise.all itself). The old code returned false for all of these (cancel). The new code retries all of them. This is probably safer in practice, but the comment is misleading.

Low: HTTP 500 classified as transient

500 from PostgREST could indicate a bug (bad query, schema mismatch), not just a transient outage. Retrying until DLQ is low risk for simple TierClient queries but worth noting.

Infra context

The PostgREST pool is configured as:

PGRST_DB_POOL=100 per ECS task
PGRST_DB_POOL_ACQUISITION_TIMEOUT=10 seconds
10 prod ECS tasks = 1000 total PostgREST connections
Auto-scaling is disabled because it "churns connection pools and causes sporadic fetch failed errors"

The 10-second acquisition timeout with auto-scaling disabled means PGRST003 is a real and recurring scenario under load, making it important to get this fix right.

Summary

Severity	Finding
Critical	Classifier checks `error.code`/`.message` but real errors are `DataAccessError` with info nested in `.cause` — the fix doesn't fix the bug
Critical	Tests use fabricated error shapes that don't match production, masking the classifier bug
Medium	`checkSiteRequiresValidation` throws are caught by caller in `index.js` — no retry, and `context.site` not set
Medium	Outer catch now rethrows ALL errors, not just transient ones
Low	HTTP 500 as transient

The classifier needs to walk the error.cause chain to find PGRST codes and messages, and tests should use DataAccessError with the real nested error shape.

fix: prevent job cancellation on transient database errors during ent…

63e0c1b

…itlement checks

Merge branch 'main' into pool

d5d7206

bhellema marked this pull request as ready for review April 6, 2026 17:50

bhellema requested a review from ekremney April 6, 2026 18:39

ekremney reviewed Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent job cancellation on transient database errors during entitlement checks#2319

fix: prevent job cancellation on transient database errors during entitlement checks#2319
bhellema wants to merge 3 commits intomainfrom
pool

bhellema commented Apr 6, 2026

Uh oh!

codecov bot commented Apr 6, 2026

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

ekremney left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bhellema commented Apr 6, 2026

Uh oh!

codecov bot commented Apr 6, 2026

Codecov Report

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

ekremney left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Review

Critical: Classifier won't match real PGRST003 errors

Critical: Tests use wrong error shapes

Medium: checkSiteRequiresValidation throw is silently caught

Medium: Outer catch in checkProductCodeEntitlements semantics changed

Low: HTTP 500 classified as transient

Infra context

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ekremney left a comment •

edited

Loading

Medium: `checkSiteRequiresValidation` throw is silently caught

Medium: Outer catch in `checkProductCodeEntitlements` semantics changed