Skip to content

fix: prevent job cancellation on transient database errors during entitlement checks#2319

Open
bhellema wants to merge 3 commits intomainfrom
pool
Open

fix: prevent job cancellation on transient database errors during entitlement checks#2319
bhellema wants to merge 3 commits intomainfrom
pool

Conversation

@bhellema
Copy link
Copy Markdown
Contributor

@bhellema bhellema commented Apr 6, 2026

Preflight jobs were incorrectly cancelled when database connection pool
timeouts (PGRST003) occurred during entitlement validation. The error
handling treated all TierClient failures as "site not entitled" and
returned false, causing AsyncJobRunner to set job status to CANCELLED
instead of allowing Lambda/SQS to retry.

Root cause: checkProductCodeEntitlements() caught all errors and
returned false, including transient infrastructure failures like:

  • PGRST000-003: PostgREST connection/timeout errors (503/504)
  • Network errors: ECONNREFUSED, ETIMEDOUT, ENOTFOUND
  • HTTP server errors: 429, 500, 502, 503, 504

Solution: Created isTransientTierClientError() classifier to distinguish
transient errors (should retry) from permanent errors (not entitled).
Transient errors now throw to trigger Lambda/SQS retry mechanism instead
of incorrectly cancelling jobs.

Changes:

  • Added tier-client-error-classifier.js with error classification logic
  • Updated checkProductCodeEntitlements() to rethrow transient errors
  • Updated checkSiteRequiresValidation() to rethrow transient errors
  • Updated isPaidLLMOCustomer() to rethrow transient errors
  • Added comprehensive test coverage for all error scenarios
  • Documented scope analysis confirming fix addresses all vulnerable areas

…itlement checks

Preflight jobs were incorrectly cancelled when database connection pool
  timeouts (PGRST003) occurred during entitlement validation. The error
  handling treated all TierClient failures as "site not entitled" and
  returned false, causing AsyncJobRunner to set job status to CANCELLED
  instead of allowing Lambda/SQS to retry.

  Root cause: checkProductCodeEntitlements() caught all errors and
  returned false, including transient infrastructure failures like:
  - PGRST000-003: PostgREST connection/timeout errors (503/504)
  - Network errors: ECONNREFUSED, ETIMEDOUT, ENOTFOUND
  - HTTP server errors: 429, 500, 502, 503, 504

  Solution: Created isTransientTierClientError() classifier to distinguish
  transient errors (should retry) from permanent errors (not entitled).
  Transient errors now throw to trigger Lambda/SQS retry mechanism instead
  of incorrectly cancelling jobs.

  Changes:
  - Added tier-client-error-classifier.js with error classification logic
  - Updated checkProductCodeEntitlements() to rethrow transient errors
  - Updated checkSiteRequiresValidation() to rethrow transient errors
  - Updated isPaidLLMOCustomer() to rethrow transient errors
  - Added comprehensive test coverage for all error scenarios
  - Documented scope analysis confirming fix addresses all vulnerable areas
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

This PR will trigger a patch release when merged.

@bhellema bhellema marked this pull request as ready for review April 6, 2026 17:50
@bhellema bhellema requested a review from ekremney April 6, 2026 18:39
Copy link
Copy Markdown
Member

@ekremney ekremney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Critical: Classifier won't match real PGRST003 errors

I traced the error propagation end-to-end through the data-access layer and TierClient.

PostgREST returns 504@supabase/postgrest-js returns { error: { code: 'PGRST003', message: 'Timed out...' }, data: null }base.collection.js#logAndThrowError('Failed to query', pgrstError) → throws DataAccessError

The DataAccessError that reaches the catch blocks looks like:

{
  name: 'DataAccessError',
  message: 'Failed to query',              // ← generic wrapper
  details: { entityName: 'Entitlement', tableName: 'entitlements' },
  cause: {                                  // ← the real info is HERE
    code: 'PGRST003',
    message: 'Timed out acquiring connection from connection pool',
    details: '...',
    hint: '...'
  }
}

TierClient re-throws this unchanged (tier-client.js:127).

The classifier checks:

  • error.codeundefined (DataAccessError has no .code)
  • error.statusCode / error.statusundefined
  • error.message'Failed to query' — does NOT contain 'connection pool', 'timed out', etc.

None of the transient checks match. The classifier returns false, and the bug remains unfixed. The PGRST code and meaningful message live at error.cause.code and error.cause.message, but the classifier never looks there.

The data-access layer itself already has the correct pattern — #isInvalidInputError in base.collection.js walks the .cause chain:

#isInvalidInputError(error) {
  let current = error;
  while (current) {
    if (current?.code === '22P02') return true;
    current = current.cause;
  }
  return false;
}

The classifier needs to do the same.


Critical: Tests use wrong error shapes

All tests construct errors like:

const dbError = new Error('Timed out acquiring connection from connection pool');
dbError.code = 'PGRST003';

But real errors are DataAccessError where .code is undefined and .message is 'Failed to query'. The tests pass because they use fabricated error shapes that don't match production, masking the classifier bug above.


Medium: checkSiteRequiresValidation throw is silently caught

In src/index.js:296-310:

try {
  const site = await Site.findById(siteId);
  if (site) {
    const requiresValidation = await checkSiteRequiresValidation(site, context, type);
    site.requiresValidation = requiresValidation;
    context.site = site;
  }
} catch (e) {
  if (!siteId.startsWith('warmup-site-')) {
    log.warn(`Failed to fetch site ${siteId}: ${e.message}`);
  }
}

If checkSiteRequiresValidation throws (per the new change), the catch here swallows it. No Lambda retry happens. The audit handler still runs, but now context.site is never assigned (the throw happens before line 303-304). This is a behavioral change — previously the site was always set on context; now on transient errors it won't be. Downstream impact depends on what handlers do when context.site is undefined.


Medium: Outer catch in checkProductCodeEntitlements semantics changed

} catch (error) {
  // Old: return false (cancel job)
  // New: throw error (retry)
  context.log.error('Transient error in entitlement check, job will retry:', error);
  throw error;
}

The comment says "a transient error was thrown from inner loop" but any error landing here is now rethrown — including non-transient unexpected errors (e.g., from Promise.all itself). The old code returned false for all of these (cancel). The new code retries all of them. This is probably safer in practice, but the comment is misleading.


Low: HTTP 500 classified as transient

500 from PostgREST could indicate a bug (bad query, schema mismatch), not just a transient outage. Retrying until DLQ is low risk for simple TierClient queries but worth noting.


Infra context

The PostgREST pool is configured as:

  • PGRST_DB_POOL=100 per ECS task
  • PGRST_DB_POOL_ACQUISITION_TIMEOUT=10 seconds
  • 10 prod ECS tasks = 1000 total PostgREST connections
  • Auto-scaling is disabled because it "churns connection pools and causes sporadic fetch failed errors"

The 10-second acquisition timeout with auto-scaling disabled means PGRST003 is a real and recurring scenario under load, making it important to get this fix right.


Summary

Severity Finding
Critical Classifier checks error.code/.message but real errors are DataAccessError with info nested in .cause — the fix doesn't fix the bug
Critical Tests use fabricated error shapes that don't match production, masking the classifier bug
Medium checkSiteRequiresValidation throws are caught by caller in index.js — no retry, and context.site not set
Medium Outer catch now rethrows ALL errors, not just transient ones
Low HTTP 500 as transient

The classifier needs to walk the error.cause chain to find PGRST codes and messages, and tests should use DataAccessError with the real nested error shape.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants