Conversation
…itlement checks Preflight jobs were incorrectly cancelled when database connection pool timeouts (PGRST003) occurred during entitlement validation. The error handling treated all TierClient failures as "site not entitled" and returned false, causing AsyncJobRunner to set job status to CANCELLED instead of allowing Lambda/SQS to retry. Root cause: checkProductCodeEntitlements() caught all errors and returned false, including transient infrastructure failures like: - PGRST000-003: PostgREST connection/timeout errors (503/504) - Network errors: ECONNREFUSED, ETIMEDOUT, ENOTFOUND - HTTP server errors: 429, 500, 502, 503, 504 Solution: Created isTransientTierClientError() classifier to distinguish transient errors (should retry) from permanent errors (not entitled). Transient errors now throw to trigger Lambda/SQS retry mechanism instead of incorrectly cancelling jobs. Changes: - Added tier-client-error-classifier.js with error classification logic - Updated checkProductCodeEntitlements() to rethrow transient errors - Updated checkSiteRequiresValidation() to rethrow transient errors - Updated isPaidLLMOCustomer() to rethrow transient errors - Added comprehensive test coverage for all error scenarios - Documented scope analysis confirming fix addresses all vulnerable areas
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
This PR will trigger a patch release when merged. |
There was a problem hiding this comment.
Review
Critical: Classifier won't match real PGRST003 errors
I traced the error propagation end-to-end through the data-access layer and TierClient.
PostgREST returns 504 → @supabase/postgrest-js returns { error: { code: 'PGRST003', message: 'Timed out...' }, data: null } → base.collection.js#logAndThrowError('Failed to query', pgrstError) → throws DataAccessError
The DataAccessError that reaches the catch blocks looks like:
{
name: 'DataAccessError',
message: 'Failed to query', // ← generic wrapper
details: { entityName: 'Entitlement', tableName: 'entitlements' },
cause: { // ← the real info is HERE
code: 'PGRST003',
message: 'Timed out acquiring connection from connection pool',
details: '...',
hint: '...'
}
}TierClient re-throws this unchanged (tier-client.js:127).
The classifier checks:
error.code→ undefined (DataAccessError has no.code)error.statusCode/error.status→ undefinederror.message→'Failed to query'— does NOT contain'connection pool','timed out', etc.
None of the transient checks match. The classifier returns false, and the bug remains unfixed. The PGRST code and meaningful message live at error.cause.code and error.cause.message, but the classifier never looks there.
The data-access layer itself already has the correct pattern — #isInvalidInputError in base.collection.js walks the .cause chain:
#isInvalidInputError(error) {
let current = error;
while (current) {
if (current?.code === '22P02') return true;
current = current.cause;
}
return false;
}The classifier needs to do the same.
Critical: Tests use wrong error shapes
All tests construct errors like:
const dbError = new Error('Timed out acquiring connection from connection pool');
dbError.code = 'PGRST003';But real errors are DataAccessError where .code is undefined and .message is 'Failed to query'. The tests pass because they use fabricated error shapes that don't match production, masking the classifier bug above.
Medium: checkSiteRequiresValidation throw is silently caught
In src/index.js:296-310:
try {
const site = await Site.findById(siteId);
if (site) {
const requiresValidation = await checkSiteRequiresValidation(site, context, type);
site.requiresValidation = requiresValidation;
context.site = site;
}
} catch (e) {
if (!siteId.startsWith('warmup-site-')) {
log.warn(`Failed to fetch site ${siteId}: ${e.message}`);
}
}If checkSiteRequiresValidation throws (per the new change), the catch here swallows it. No Lambda retry happens. The audit handler still runs, but now context.site is never assigned (the throw happens before line 303-304). This is a behavioral change — previously the site was always set on context; now on transient errors it won't be. Downstream impact depends on what handlers do when context.site is undefined.
Medium: Outer catch in checkProductCodeEntitlements semantics changed
} catch (error) {
// Old: return false (cancel job)
// New: throw error (retry)
context.log.error('Transient error in entitlement check, job will retry:', error);
throw error;
}The comment says "a transient error was thrown from inner loop" but any error landing here is now rethrown — including non-transient unexpected errors (e.g., from Promise.all itself). The old code returned false for all of these (cancel). The new code retries all of them. This is probably safer in practice, but the comment is misleading.
Low: HTTP 500 classified as transient
500 from PostgREST could indicate a bug (bad query, schema mismatch), not just a transient outage. Retrying until DLQ is low risk for simple TierClient queries but worth noting.
Infra context
The PostgREST pool is configured as:
PGRST_DB_POOL=100per ECS taskPGRST_DB_POOL_ACQUISITION_TIMEOUT=10seconds- 10 prod ECS tasks = 1000 total PostgREST connections
- Auto-scaling is disabled because it "churns connection pools and causes sporadic fetch failed errors"
The 10-second acquisition timeout with auto-scaling disabled means PGRST003 is a real and recurring scenario under load, making it important to get this fix right.
Summary
| Severity | Finding |
|---|---|
| Critical | Classifier checks error.code/.message but real errors are DataAccessError with info nested in .cause — the fix doesn't fix the bug |
| Critical | Tests use fabricated error shapes that don't match production, masking the classifier bug |
| Medium | checkSiteRequiresValidation throws are caught by caller in index.js — no retry, and context.site not set |
| Medium | Outer catch now rethrows ALL errors, not just transient ones |
| Low | HTTP 500 as transient |
The classifier needs to walk the error.cause chain to find PGRST codes and messages, and tests should use DataAccessError with the real nested error shape.
Preflight jobs were incorrectly cancelled when database connection pool
timeouts (PGRST003) occurred during entitlement validation. The error
handling treated all TierClient failures as "site not entitled" and
returned false, causing AsyncJobRunner to set job status to CANCELLED
instead of allowing Lambda/SQS to retry.
Root cause: checkProductCodeEntitlements() caught all errors and
returned false, including transient infrastructure failures like:
Solution: Created isTransientTierClientError() classifier to distinguish
transient errors (should retry) from permanent errors (not entitled).
Transient errors now throw to trigger Lambda/SQS retry mechanism instead
of incorrectly cancelling jobs.
Changes: