diff --git a/docs/plans/fix-database-timeout-entitlement-check.md b/docs/plans/fix-database-timeout-entitlement-check.md new file mode 100644 index 0000000000..73393e9c6e --- /dev/null +++ b/docs/plans/fix-database-timeout-entitlement-check.md @@ -0,0 +1,277 @@ +# Fix Database Timeout Issue in Entitlement Checking + +## Context + +Preflight job `d272041d-f014-4792-a1a1-944e5c0c57ab` was incorrectly **cancelled** on March 31, 2026 due to a database connection pool timeout (`PGRST003`). The entitlement check in `checkProductCodeEntitlements()` caught the transient infrastructure error and returned `false`, causing the system to treat it as "site not entitled" and set the job status to `CANCELLED`. + +The problem: **All errors are treated equally** - transient infrastructure failures (database timeouts, network issues) are handled the same as legitimate "site not entitled" scenarios. This causes jobs to be cancelled instead of allowing Lambda/SQS to retry. + +### Error Flow from Logs +1. Database: `PGRST003: Timed out acquiring connection from connection pool` +2. TierClient throws error → caught at `/src/common/audit-utils.js:38-40` +3. Returns `false` → `isAuditEnabledForSite()` returns `false` +4. AsyncJobRunner sets `AsyncJob.Status.CANCELLED` (line 91) +5. Job skipped with reason: "preflight audits disabled for site" + +### Expected Behavior +- **Transient errors** (database timeouts, network failures) → Throw error, let Lambda retry +- **Permanent errors** (not entitled, not found) → Return `false`, cancel job gracefully + +## Solution Approach + +**Strategy**: Distinguish transient from permanent errors and **rethrow transient errors** to trigger Lambda/SQS retry mechanism (no local retry logic). + +**Why no local retry?** +- Lambda/SQS provides automatic retries with backoff and DLQ +- Local retry adds complexity and consumes Lambda timeout budget +- Matches existing pattern in codebase (internal-links lets "SQS retry complete finalization") + +## Implementation Plan + +### 1. Create Error Classifier Utility + +**New file**: `/src/common/tier-client-error-classifier.js` + +Create `isTransientTierClientError(error)` function to classify errors: + +**Transient patterns** (should retry): +- Database: `PGRST000`, `PGRST001`, `PGRST002`, `PGRST003`, `connection pool`, `timeout`, `ECONNREFUSED`, `ETIMEDOUT` + - PGRST000: Could not connect with database (503) + - PGRST001: Could not connect due to internal error (503) + - PGRST002: Could not connect when building schema cache (503) + - PGRST003: Timed out acquiring connection from connection pool (504) +- Network: `ENOTFOUND`, `ECONNRESET`, `EAI_AGAIN`, `socket hang up`, `network error` +- HTTP: 408, 429, 500, 502, 503, 504 (if exposed) +- Generic: `temporary failure`, `service unavailable` + +**Permanent patterns** (skip audit): +- HTTP: 401, 403, 404 +- PostgREST: `PGRST100-122` (bad request), `PGRST200-204` (not found), `PGRST300` (config error) +- Business: `not enrolled`, `no entitlement`, `invalid product code` + +Implementation: +```javascript +export function isTransientTierClientError(error) { + const message = error?.message?.toLowerCase() || ''; + const code = error?.code?.toUpperCase() || ''; + const statusCode = error?.statusCode || error?.status; + + // Database errors - PostgREST connection and timeout issues + const transientPostgrestCodes = ['PGRST000', 'PGRST001', 'PGRST002', 'PGRST003']; + if (transientPostgrestCodes.includes(code)) return true; + + if (message.includes('connection pool') || message.includes('timed out')) return true; + + // Network errors + if (['ECONNREFUSED', 'ETIMEDOUT', 'ENOTFOUND', 'ECONNRESET'].includes(code)) return true; + if (message.includes('network error') || message.includes('socket hang up')) return true; + + // HTTP errors + if ([408, 429, 500, 502, 503, 504].includes(statusCode)) return true; + + // Default to permanent (conservative) + return false; +} +``` + +### 2. Update `checkProductCodeEntitlements()` + +**File**: `/src/common/audit-utils.js` (lines 25-48) + +**Changes**: +1. Import classifier: `import { isTransientTierClientError } from './tier-client-error-classifier.js';` +2. Update inner catch block (line 37-40): + ```javascript + } catch (error) { + if (isTransientTierClientError(error)) { + context.log.error(`Transient error checking entitlement for product ${productCode}, will retry:`, error); + throw new Error(`Transient entitlement check error for ${productCode}: ${error.message}`, { cause: error }); + } + context.log.warn(`Site not entitled for product code ${productCode}:`, error); + return false; + } + ``` +3. Update outer catch block (line 44-47): + ```javascript + } catch (error) { + // If we reach here, a transient error was thrown from inner loop + context.log.error('Transient error in entitlement check, job will retry:', error); + throw error; // Propagate to trigger Lambda retry + } + ``` + +### 3. Update `checkSiteRequiresValidation()` + +**File**: `/src/utils/site-validation.js` (lines 54-65) + +Add error classification: +```javascript +} catch (e) { + if (isTransientTierClientError(e)) { + context?.log?.error?.(`Transient error checking validation requirement for site ${site.getId?.()}: ${e.message}`); + throw new Error(`Transient entitlement check error: ${e.message}`, { cause: e }); + } + context?.log?.warn?.(`Entitlement check failed for site ${site.getId?.()}: ${e.message}`); +} +``` + +### 4. Update `isPaidLLMOCustomer()` + +**File**: `/src/prerender/utils/utils.js` (lines 116-119) + +Add error classification: +```javascript +} catch (e) { + if (isTransientTierClientError(e)) { + log.error(`Transient error checking paid LLMO status for siteId=${site.getId()}: ${e.message}`); + throw new Error(`Transient entitlement check error: ${e.message}`, { cause: e }); + } + log.warn(`Prerender - Failed to check paid LLMO customer status for siteId=${site.getId()}: ${e.message}`); + return false; +} +``` + +### 5. Add Clarifying Comment to AsyncJobRunner + +**File**: `/src/common/async-job-runner.js` (line 89-99) + +Add comment: +```javascript +// Note: CANCELLED status is for sites explicitly not entitled or with audits disabled. +// Transient errors (database timeouts, network issues) will throw and trigger Lambda retry. +if (!(await isAuditEnabledForSite(type, site, context))) { +``` + +## Testing Plan + +### New Test File: `test/common/tier-client-error-classifier.test.js` + +Test cases: +- ✅ PGRST000, PGRST001, PGRST002, PGRST003 → returns true (transient) +- ✅ Connection pool timeout → returns true +- ✅ Network errors (ECONNREFUSED, ETIMEDOUT) → returns true +- ✅ HTTP 429, 500, 502, 503, 504 → returns true +- ✅ HTTP 401, 403, 404 → returns false (permanent) +- ✅ PostgREST permanent errors (PGRST100, PGRST202, PGRST300) → returns false +- ✅ Generic errors → returns false (conservative default) +- ✅ Null/undefined error → returns false + +### Update `test/common/audit-utils.test.js` + +Add test cases after line 295: +- ✅ PGRST003 error throws (doesn't return false) +- ✅ Connection timeout throws +- ✅ Network error throws +- ✅ 404 error returns false (existing behavior preserved) +- ✅ Mixed: one transient, one permanent → throws immediately +- ✅ Mixed: one transient, one succeeds → throws immediately + +### Update `test/utils/site-validation.test.js` + +Add transient error test cases for `checkSiteRequiresValidation()` + +### Update `test/audits/prerender/utils.test.js` + +Add transient error test cases for `isPaidLLMOCustomer()` + +## Verification + +### Unit Tests +```bash +npm test -- test/common/tier-client-error-classifier.test.js +npm test -- test/common/audit-utils.test.js +npm test -- test/utils/site-validation.test.js +npm test -- test/audits/prerender/utils.test.js +``` + +### Integration Test +Mock PGRST003 error and verify: +1. Job throws error (doesn't return false) +2. Job status is NOT set to CANCELLED +3. Lambda retry mechanism triggered +4. Logs contain "Transient error" messages + +### Observability +Query Coralogix for: +```dataprime +source logs +| filter $l.subsystemname == 'spacecat-services-prod' + && $d.message ~ 'Transient error' +| orderby $m.timestamp desc +``` + +## Critical Files + +**Source**: +- `src/common/tier-client-error-classifier.js` (NEW) +- `src/common/audit-utils.js` (MODIFY lines 25-48) +- `src/utils/site-validation.js` (MODIFY lines 54-65) +- `src/prerender/utils/utils.js` (MODIFY lines 116-119) +- `src/common/async-job-runner.js` (ADD COMMENT line 89) + +**Tests**: +- `test/common/tier-client-error-classifier.test.js` (NEW) +- `test/common/audit-utils.test.js` (ADD TESTS after line 295) +- `test/utils/site-validation.test.js` (ADD TESTS) +- `test/audits/prerender/utils.test.js` (ADD TESTS) + +## Backwards Compatibility + +✅ **Preserved**: Sites without entitlements still return `false`, job `CANCELLED` +✅ **Preserved**: Permanent errors (404, not found) still return `false` +⚠️ **Changed**: Transient errors now throw instead of returning `false` + +**Impact**: Jobs previously cancelled due to database timeouts will now retry via Lambda/SQS until successful or DLQ. + +## Scope Analysis: Why This Fix Is Sufficient + +### Investigation: Other Database Operations + +A comprehensive search was conducted to identify if similar issues exist elsewhere in the codebase. + +**Finding**: **TierClient entitlement checks were the only vulnerable area.** + +### Why Other Database Operations Are Safe: + +1. **Configuration.findLatest() calls** (~40+ locations) + - These properly **throw errors** on failure, not return false + - Throwing causes Lambda to fail/retry (correct behavior) + - Example: Multiple handlers use this pattern without fail-safe catching + +2. **Data access utilities** (`src/utils/data-access.js`) + - Functions like `retrieveSiteBySiteId()` and `retrieveAuditById()` throw on error: + ```javascript + } catch (e) { + throw new Error(`Error getting site ${siteId}: ${e.message}`); + } + ``` + - Errors propagate to Lambda for retry + +3. **Most audit handlers** + - Database errors naturally bubble up and cause Lambda failure/retry + - No fail-safe pattern that misinterprets transient errors + +### Why TierClient Was Unique: + +The issue was specific to **pre-audit entitlement gating**: +- Runs in `AsyncJobRunner` **before** audit execution +- Occurs in: `isAuditEnabledForSite()` → `checkProductCodeEntitlements()` → TierClient +- Returning `false` triggers `AsyncJob.Status.CANCELLED` (line 91) +- **Transient infrastructure errors were misinterpreted as "site not entitled"** + +### Conclusion: + +This fix addresses **all vulnerable locations**: +- ✅ `checkProductCodeEntitlements()` - Main entitlement gate +- ✅ `checkSiteRequiresValidation()` - Validation requirement check +- ✅ `isPaidLLMOCustomer()` - LLMO tier check + +**Future Watch**: Any new entitlement checks should use `isTransientTierClientError()` to avoid this pattern. + +## Rollout + +1. Deploy to dev, verify unit tests pass +2. Monitor CloudWatch for "Transient error" logs +3. Verify no false positives (legitimate "not entitled" being retried) +4. Deploy to stage, monitor 24-48 hours +5. Deploy to prod with CloudWatch alarm for DLQ depth diff --git a/src/common/async-job-runner.js b/src/common/async-job-runner.js index c90b5064d4..f19bed657f 100644 --- a/src/common/async-job-runner.js +++ b/src/common/async-job-runner.js @@ -86,6 +86,8 @@ export class AsyncJobRunner extends StepAudit { const site = await this.siteProvider(siteId, context); + // Note: CANCELLED status is for sites explicitly not entitled or with audits disabled. + // Transient errors (database timeouts, network issues) will throw and trigger Lambda retry. if (!(await isAuditEnabledForSite(type, site, context))) { log.debug(`${type} audits disabled for site ${siteId}, skipping...`); job.setStatus(AsyncJob.Status.CANCELLED); diff --git a/src/common/audit-utils.js b/src/common/audit-utils.js index 8922b3fc74..27e19f68ce 100644 --- a/src/common/audit-utils.js +++ b/src/common/audit-utils.js @@ -13,6 +13,7 @@ import { isValidUUID, isNonEmptyArray } from '@adobe/spacecat-shared-utils'; import { TierClient } from '@adobe/spacecat-shared-tier-client'; import { retrieveAuditById } from '../utils/data-access.js'; +import { isTransientTierClientError } from './tier-client-error-classifier.js'; /** * Check if site has site enrollment for any of the product codes @@ -35,15 +36,22 @@ export async function checkProductCodeEntitlements(productCodes, site, context) const tierResult = await tierClient.checkValidEntitlement(); return tierResult.siteEnrollment || false; } catch (error) { - context.log.error(`Failed to check entitlement for product code ${productCode}:`, error); + if (isTransientTierClientError(error)) { + context.log.error(`Transient error checking entitlement for product ${productCode}, will retry:`, error); + throw new Error(`Transient entitlement check error for ${productCode}: ${error.message}`, { cause: error }); + } + context.log.warn(`Site not entitled for product code ${productCode}:`, error); return false; } }), ); return enrollmentChecks.some((hasEnrollment) => hasEnrollment); } catch (error) { - context.log.error('Error checking product code entitlements:', error); - return false; // Fail safe - deny audit if entitlement check fails + // Any error reaching here (transient or unexpected) triggers retry rather than cancellation. + // Inner loop only throws on transient errors, but this also catches unexpected errors + // from Promise.all or the enrollment check itself, which is safer than returning false. + context.log.error('Error in entitlement check, job will retry:', error); + throw error; // Propagate to trigger Lambda retry } } diff --git a/src/common/tier-client-error-classifier.js b/src/common/tier-client-error-classifier.js new file mode 100644 index 0000000000..de5813630c --- /dev/null +++ b/src/common/tier-client-error-classifier.js @@ -0,0 +1,91 @@ +/* + * Copyright 2026 Adobe. All rights reserved. + * This file is licensed to you under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. You may obtain a copy + * of the License at http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software distributed under + * the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR REPRESENTATIONS + * OF ANY KIND, either express or implied. See the License for the specific language + * governing permissions and limitations under the License. + */ + +/** + * Classifies TierClient errors as transient (should retry) or permanent (site not entitled). + * + * Transient errors indicate temporary infrastructure issues that may resolve on retry: + * - Database connection pool timeouts (PGRST003) + * - Network errors (ECONNREFUSED, ETIMEDOUT, ENOTFOUND, ECONNRESET) + * - HTTP server errors (408, 429, 500, 502, 503, 504) + * + * Permanent errors indicate the site is legitimately not entitled: + * - HTTP client errors (401, 403, 404) + * - Business logic errors (no entitlement, not enrolled, invalid product code) + * + * Walks the error.cause chain since DataAccessError wraps PostgREST errors in .cause property. + * + * @param {Error} error - Error from TierClient.checkValidEntitlement() + * @returns {boolean} - True if transient/retryable, false if permanent + */ +export function isTransientTierClientError(error) { + if (!error) { + return false; + } + + // Walk the error.cause chain to find transient indicators. + // Real DataAccessError from data-access layer wraps PostgREST errors: + // { name: 'DataAccessError', message: 'Failed to query', + // cause: { code: 'PGRST003', message: '...' } } + let current = error; + while (current) { + const message = current?.message?.toLowerCase() || ''; + const code = current?.code?.toUpperCase() || ''; + const statusCode = current?.statusCode || current?.status; + + // Database errors - PostgREST connection and timeout issues + // PGRST000: Could not connect with database (503) + // PGRST001: Could not connect due to internal error (503) + // PGRST002: Could not connect when building schema cache (503) + // PGRST003: Timed out acquiring connection from pool (504) + const transientPostgrestCodes = ['PGRST000', 'PGRST001', 'PGRST002', 'PGRST003']; + if (transientPostgrestCodes.includes(code)) { + return true; + } + + // Database timeout patterns + if (message.includes('connection pool') || message.includes('timed out')) { + return true; + } + + // Network errors + const transientNetworkCodes = ['ECONNREFUSED', 'ETIMEDOUT', 'ENOTFOUND', 'ECONNRESET']; + if (transientNetworkCodes.includes(code)) { + return true; + } + + // Network error patterns in message + if (message.includes('network error') + || message.includes('socket hang up') + || message.includes('eai_again')) { + return true; + } + + // Transient HTTP status codes + const transientStatusCodes = [408, 429, 500, 502, 503, 504]; + if (statusCode && transientStatusCodes.includes(statusCode)) { + return true; + } + + // Generic transient patterns + if (message.includes('temporary failure') || message.includes('service unavailable')) { + return true; + } + + // Walk to the next error in the cause chain + current = current.cause; + } + + // Default to permanent (conservative approach) + // Permanent errors include: 401, 403, 404, business logic errors, etc. + return false; +} diff --git a/src/index.js b/src/index.js index 72b8110962..1d9001240e 100644 --- a/src/index.js +++ b/src/index.js @@ -16,6 +16,7 @@ import { sqsEventAdapter, logWrapper } from '@adobe/spacecat-shared-utils'; import { internalServerError, notFound, ok } from '@adobe/spacecat-shared-http-utils'; import dataAccess from './support/data-access.js'; import { checkSiteRequiresValidation } from './utils/site-validation.js'; +import { isTransientTierClientError } from './common/tier-client-error-classifier.js'; import sqs from './support/sqs.js'; import s3Client from './support/s3-client.js'; @@ -304,6 +305,12 @@ async function run(message, context) { context.site = site; } } catch (e) { + // Rethrow transient errors to trigger Lambda/SQS retry + if (isTransientTierClientError(e)) { + log.error(`Transient error during site validation check for ${siteId}, triggering retry: ${e.message}`); + throw e; + } + // Log and swallow non-transient errors (e.g., site not found, not entitled) if (!siteId.startsWith('warmup-site-')) { log.warn(`Failed to fetch site ${siteId}: ${e.message}`); } diff --git a/src/prerender/utils/utils.js b/src/prerender/utils/utils.js index 3dbf3837a5..d51f30a2a9 100644 --- a/src/prerender/utils/utils.js +++ b/src/prerender/utils/utils.js @@ -16,6 +16,7 @@ import { Entitlement } from '@adobe/spacecat-shared-data-access'; import { TierClient } from '@adobe/spacecat-shared-tier-client'; +import { isTransientTierClientError } from '../../common/tier-client-error-classifier.js'; /** * Common non-HTML file extensions that should be filtered out @@ -114,6 +115,10 @@ export async function isPaidLLMOCustomer(context) { log.debug(`Prerender - isPaidLLMOCustomer check: siteId=${site.getId()}, tier=${tier}, isPaid=${isPaid}`); return isPaid; } catch (e) { + if (isTransientTierClientError(e)) { + log.error(`Transient error checking paid LLMO status for siteId=${site.getId()}: ${e.message}`); + throw new Error(`Transient entitlement check error: ${e.message}`, { cause: e }); + } log.warn(`Prerender - Failed to check paid LLMO customer status for siteId=${site.getId()}: ${e.message}`); return false; } diff --git a/src/utils/site-validation.js b/src/utils/site-validation.js index d5b47cfcbc..19c82dc236 100644 --- a/src/utils/site-validation.js +++ b/src/utils/site-validation.js @@ -12,6 +12,7 @@ import { TierClient } from '@adobe/spacecat-shared-tier-client'; import { Entitlement } from '@adobe/spacecat-shared-data-access'; +import { isTransientTierClientError } from '../common/tier-client-error-classifier.js'; const ASO_PRODUCT_CODE = Entitlement.PRODUCT_CODES.ASO; @@ -61,6 +62,10 @@ export async function checkSiteRequiresValidation(site, context, auditType) { return true; } } catch (e) { + if (isTransientTierClientError(e)) { + context?.log?.error?.(`Transient error checking validation requirement for site ${site.getId?.()}: ${e.message}`); + throw new Error(`Transient entitlement check error: ${e.message}`, { cause: e }); + } context?.log?.warn?.(`Entitlement check failed for site ${site.getId?.()}: ${e.message}`); } diff --git a/test/audits/prerender/utils.test.js b/test/audits/prerender/utils.test.js index b66f96cfcf..90c1f53e96 100644 --- a/test/audits/prerender/utils.test.js +++ b/test/audits/prerender/utils.test.js @@ -15,9 +15,12 @@ import { expect, use } from 'chai'; import sinon from 'sinon'; import sinonChai from 'sinon-chai'; +import chaiAsPromised from 'chai-as-promised'; import esmock from 'esmock'; +import { DataAccessError } from '@adobe/spacecat-shared-data-access'; use(sinonChai); +use(chaiAsPromised); describe('Prerender Utils', () => { let sandbox; @@ -35,6 +38,7 @@ describe('Prerender Utils', () => { log = { debug: sandbox.stub(), warn: sandbox.stub(), + error: sandbox.stub(), }; mockSite = { @@ -131,27 +135,59 @@ describe('Prerender Utils', () => { ); }); - it('should return false and log warning when TierClient.createForSite fails', async () => { - TierClientStub.createForSite.rejects(new Error('TierClient creation failed')); + it('should return false and log warning when permanent error occurs in createForSite', async () => { + const permanentError = new Error('Not Found'); + permanentError.statusCode = 404; + TierClientStub.createForSite.rejects(permanentError); const context = { site: mockSite, log }; const result = await utils.isPaidLLMOCustomer(context); expect(result).to.be.false; expect(log.warn).to.have.been.calledWith( - sinon.match(/Failed to check paid LLMO customer status.*siteId=test-site-id.*TierClient creation failed/), + sinon.match(/Failed to check paid LLMO customer status.*siteId=test-site-id.*Not Found/), ); }); - it('should return false and log warning when checkValidEntitlement fails', async () => { - mockTierClient.checkValidEntitlement.rejects(new Error('Entitlement check failed')); + it('should throw when transient PGRST003 error occurs in createForSite', async () => { + const pgrstError = { + code: 'PGRST003', + message: 'Timed out acquiring connection from connection pool', + }; + const dbError = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, pgrstError); + TierClientStub.createForSite.rejects(dbError); + + const context = { site: mockSite, log }; + + await expect(utils.isPaidLLMOCustomer(context)) + .to.be.rejectedWith('Transient entitlement check error'); + expect(log.error).to.have.been.calledWith( + sinon.match(/Transient error checking paid LLMO status.*siteId=test-site-id/), + ); + }); + + it('should throw when transient network error occurs', async () => { + const networkCause = { message: 'Network error occurred' }; + const networkError = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, networkCause); + TierClientStub.createForSite.rejects(networkError); + + const context = { site: mockSite, log }; + + await expect(utils.isPaidLLMOCustomer(context)) + .to.be.rejectedWith('Transient entitlement check error'); + expect(log.error).to.have.been.called; + }); + + it('should return false and log warning when permanent error occurs in checkValidEntitlement', async () => { + const permanentError = new Error('No entitlement found'); + mockTierClient.checkValidEntitlement.rejects(permanentError); const context = { site: mockSite, log }; const result = await utils.isPaidLLMOCustomer(context); expect(result).to.be.false; expect(log.warn).to.have.been.calledWith( - sinon.match(/Failed to check paid LLMO customer status.*siteId=test-site-id.*Entitlement check failed/), + sinon.match(/Failed to check paid LLMO customer status.*siteId=test-site-id.*No entitlement found/), ); }); diff --git a/test/common/audit-utils.test.js b/test/common/audit-utils.test.js index e6f3d7f946..62f6d14e11 100644 --- a/test/common/audit-utils.test.js +++ b/test/common/audit-utils.test.js @@ -18,6 +18,7 @@ import sinonChai from 'sinon-chai'; import chaiAsPromised from 'chai-as-promised'; import { TierClient } from '@adobe/spacecat-shared-tier-client'; +import { DataAccessError } from '@adobe/spacecat-shared-data-access'; import { isAuditEnabledForSite, loadExistingAudit, @@ -285,13 +286,102 @@ describe('Audit Utils Tests', () => { expect(result).to.be.true; }); - it('handles critical error in checkProductCodeEntitlements and returns false', async () => { + it('handles critical error in checkProductCodeEntitlements and throws', async () => { // Mock Promise.all to throw an error to trigger the outer try-catch block sandbox.stub(Promise, 'all').throws(new Error('Critical Promise.all error')); - const result = await checkProductCodeEntitlements(['ASO', 'LLMO'], site, context); + await expect(checkProductCodeEntitlements(['ASO', 'LLMO'], site, context)) + .to.be.rejectedWith('Critical Promise.all error'); + expect(context.log.error).to.have.been.calledWith('Error in entitlement check, job will retry:', sinon.match.instanceOf(Error)); + }); + + it('throws when PGRST003 database timeout occurs', async () => { + const pgrstError = { + code: 'PGRST003', + message: 'Timed out acquiring connection from connection pool', + }; + const dbError = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, pgrstError); + const mockTierClient = { + checkValidEntitlement: sandbox.stub().rejects(dbError), + }; + sandbox.stub(TierClient, 'createForSite').returns(mockTierClient); + + await expect(checkProductCodeEntitlements(['ASO'], site, context)) + .to.be.rejectedWith('Transient entitlement check error for ASO'); + expect(context.log.error).to.have.been.calledWith( + 'Transient error checking entitlement for product ASO, will retry:', + dbError, + ); + }); + + it('throws when connection timeout occurs', async () => { + const networkCause = { code: 'ETIMEDOUT', message: 'Connection timed out' }; + const timeoutError = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, networkCause); + const mockTierClient = { + checkValidEntitlement: sandbox.stub().rejects(timeoutError), + }; + sandbox.stub(TierClient, 'createForSite').returns(mockTierClient); + + await expect(checkProductCodeEntitlements(['ASO'], site, context)) + .to.be.rejectedWith('Transient entitlement check error for ASO'); + }); + + it('throws when network error occurs', async () => { + const networkCause = { message: 'Network error occurred' }; + const networkError = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, networkCause); + const mockTierClient = { + checkValidEntitlement: sandbox.stub().rejects(networkError), + }; + sandbox.stub(TierClient, 'createForSite').returns(mockTierClient); + + await expect(checkProductCodeEntitlements(['ASO'], site, context)) + .to.be.rejectedWith('Transient entitlement check error for ASO'); + }); + + it('returns false when permanent error (404) occurs', async () => { + const httpCause = { statusCode: 404, message: 'Not Found' }; + const notFoundError = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, httpCause); + const mockTierClient = { + checkValidEntitlement: sandbox.stub().rejects(notFoundError), + }; + sandbox.stub(TierClient, 'createForSite').returns(mockTierClient); + + const result = await checkProductCodeEntitlements(['ASO'], site, context); expect(result).to.be.false; - expect(context.log.error).to.have.been.calledWith('Error checking product code entitlements:', sinon.match.instanceOf(Error)); + expect(context.log.warn).to.have.been.calledWith( + 'Site not entitled for product code ASO:', + notFoundError, + ); + }); + + it('throws immediately when transient error occurs even with multiple product codes', async () => { + const dbError = new Error('Connection pool exhausted'); + const mockTierClient = { + checkValidEntitlement: sandbox.stub() + .onFirstCall() + .rejects(dbError) + .onSecondCall() + .resolves({ siteEnrollment: mockSiteEnrollment }), + }; + sandbox.stub(TierClient, 'createForSite').returns(mockTierClient); + + await expect(checkProductCodeEntitlements(['ASO', 'LLMO'], site, context)) + .to.be.rejectedWith('Transient entitlement check error for ASO'); + }); + + it('returns true when one product has permanent error but another succeeds', async () => { + const notFoundError = new Error('Not enrolled'); + const mockTierClient = { + checkValidEntitlement: sandbox.stub() + .onFirstCall() + .rejects(notFoundError) + .onSecondCall() + .resolves({ siteEnrollment: mockSiteEnrollment }), + }; + sandbox.stub(TierClient, 'createForSite').returns(mockTierClient); + + const result = await checkProductCodeEntitlements(['ASO', 'LLMO'], site, context); + expect(result).to.be.true; }); }); diff --git a/test/common/index-site-validation.test.js b/test/common/index-site-validation.test.js index 0ca7ddddc0..89a00a9487 100644 --- a/test/common/index-site-validation.test.js +++ b/test/common/index-site-validation.test.js @@ -90,4 +90,26 @@ describe('Index siteId handling and validation flag', () => { // Site should not be set on context since fetch failed expect(context.site).to.be.undefined; }); + + it('rethrows transient entitlement errors during site validation to trigger retry', async () => { + const transientError = new Error('tier service unavailable'); + transientError.statusCode = 503; + + sandbox.stub(TierClient, 'createForSite').returns({ + checkValidEntitlement: sandbox.stub().rejects(transientError), + }); + + let thrown; + try { + await main(new Request('https://space.cat'), context); + } catch (e) { + thrown = e; + } + + expect(thrown).to.be.an('error'); + expect(thrown.message).to.equal('Transient entitlement check error: tier service unavailable'); + expect(context.log.error).to.have.been.calledWith( + 'Transient error during site validation check for site-xyz, triggering retry: Transient entitlement check error: tier service unavailable', + ); + }); }); diff --git a/test/common/tier-client-error-classifier.test.js b/test/common/tier-client-error-classifier.test.js new file mode 100644 index 0000000000..df359d0ea0 --- /dev/null +++ b/test/common/tier-client-error-classifier.test.js @@ -0,0 +1,281 @@ +/* + * Copyright 2026 Adobe. All rights reserved. + * This file is licensed to you under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. You may obtain a copy + * of the License at http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software distributed under + * the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR REPRESENTATIONS + * OF ANY KIND, either express or implied. See the License for the specific language + * governing permissions and limitations under the License. + */ + +/* eslint-env mocha */ + +import { expect } from 'chai'; +import { DataAccessError } from '@adobe/spacecat-shared-data-access'; +import { isTransientTierClientError } from '../../src/common/tier-client-error-classifier.js'; + +describe('tier-client-error-classifier', () => { + describe('isTransientTierClientError', () => { + describe('transient database errors', () => { + it('returns true for PGRST000 error code (connection error)', () => { + const pgrstError = { code: 'PGRST000', message: 'Could not connect with the database' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, pgrstError); + expect(isTransientTierClientError(error)).to.be.true; + }); + + it('returns true for PGRST001 error code (internal connection error)', () => { + const pgrstError = { code: 'PGRST001', message: 'Could not connect due to internal error' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, pgrstError); + expect(isTransientTierClientError(error)).to.be.true; + }); + + it('returns true for PGRST002 error code (schema cache connection error)', () => { + const pgrstError = { code: 'PGRST002', message: 'Could not connect when building schema cache' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, pgrstError); + expect(isTransientTierClientError(error)).to.be.true; + }); + + it('returns true for PGRST003 error code (pool timeout)', () => { + const pgrstError = { + code: 'PGRST003', + message: 'Timed out acquiring connection from connection pool', + details: 'Waited 10 seconds', + hint: 'Increase pool size or timeout', + }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, pgrstError); + expect(isTransientTierClientError(error)).to.be.true; + }); + + it('returns true for connection pool timeout message in cause', () => { + const pgrstError = { + code: 'PGRST003', + message: 'Timed out acquiring connection from connection pool', + }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, pgrstError); + expect(isTransientTierClientError(error)).to.be.true; + }); + + it('returns true for generic timeout message in cause', () => { + const pgrstError = { message: 'Operation timed out' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, pgrstError); + expect(isTransientTierClientError(error)).to.be.true; + }); + + it('returns true for connection pool message (case insensitive) in cause', () => { + const pgrstError = { message: 'CONNECTION POOL exhausted' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, pgrstError); + expect(isTransientTierClientError(error)).to.be.true; + }); + }); + + describe('transient network errors', () => { + it('returns true for ECONNREFUSED', () => { + const networkError = { code: 'ECONNREFUSED', message: 'Connection refused' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, networkError); + expect(isTransientTierClientError(error)).to.be.true; + }); + + it('returns true for ETIMEDOUT', () => { + const networkError = { code: 'ETIMEDOUT', message: 'Connection timed out' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, networkError); + expect(isTransientTierClientError(error)).to.be.true; + }); + + it('returns true for ENOTFOUND', () => { + const networkError = { code: 'ENOTFOUND', message: 'DNS lookup failed' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, networkError); + expect(isTransientTierClientError(error)).to.be.true; + }); + + it('returns true for ECONNRESET', () => { + const networkError = { code: 'ECONNRESET', message: 'Connection reset by peer' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, networkError); + expect(isTransientTierClientError(error)).to.be.true; + }); + + it('returns true for network error message in cause', () => { + const networkError = { message: 'Network error occurred' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, networkError); + expect(isTransientTierClientError(error)).to.be.true; + }); + + it('returns true for socket hang up message in cause', () => { + const networkError = { message: 'socket hang up' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, networkError); + expect(isTransientTierClientError(error)).to.be.true; + }); + + it('returns true for EAI_AGAIN message in cause', () => { + const networkError = { message: 'getaddrinfo EAI_AGAIN' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, networkError); + expect(isTransientTierClientError(error)).to.be.true; + }); + }); + + describe('transient HTTP errors', () => { + it('returns true for HTTP 408 (Request Timeout) in cause', () => { + const httpError = { statusCode: 408, message: 'Request Timeout' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, httpError); + expect(isTransientTierClientError(error)).to.be.true; + }); + + it('returns true for HTTP 429 (Too Many Requests) in cause', () => { + const httpError = { statusCode: 429, message: 'Too Many Requests' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, httpError); + expect(isTransientTierClientError(error)).to.be.true; + }); + + it('returns true for HTTP 500 (Internal Server Error) in cause', () => { + const httpError = { statusCode: 500, message: 'Internal Server Error' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, httpError); + expect(isTransientTierClientError(error)).to.be.true; + }); + + it('returns true for HTTP 502 (Bad Gateway) in cause', () => { + const httpError = { statusCode: 502, message: 'Bad Gateway' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, httpError); + expect(isTransientTierClientError(error)).to.be.true; + }); + + it('returns true for HTTP 503 (Service Unavailable) in cause', () => { + const httpError = { statusCode: 503, message: 'Service Unavailable' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, httpError); + expect(isTransientTierClientError(error)).to.be.true; + }); + + it('returns true for HTTP 504 (Gateway Timeout) in cause', () => { + const httpError = { statusCode: 504, message: 'Gateway Timeout' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, httpError); + expect(isTransientTierClientError(error)).to.be.true; + }); + + it('returns true for status property instead of statusCode in cause', () => { + const httpError = { status: 504, message: 'Gateway Timeout' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, httpError); + expect(isTransientTierClientError(error)).to.be.true; + }); + }); + + describe('transient generic patterns', () => { + it('returns true for temporary failure message in cause', () => { + const innerError = { message: 'Temporary failure in name resolution' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, innerError); + expect(isTransientTierClientError(error)).to.be.true; + }); + + it('returns true for service unavailable message in cause', () => { + const innerError = { message: 'Service unavailable, please try again' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, innerError); + expect(isTransientTierClientError(error)).to.be.true; + }); + }); + + describe('permanent errors', () => { + it('returns false for HTTP 401 (Unauthorized)', () => { + const httpError = { statusCode: 401, message: 'Unauthorized' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, httpError); + expect(isTransientTierClientError(error)).to.be.false; + }); + + it('returns false for HTTP 403 (Forbidden)', () => { + const httpError = { statusCode: 403, message: 'Forbidden' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, httpError); + expect(isTransientTierClientError(error)).to.be.false; + }); + + it('returns false for HTTP 404 (Not Found)', () => { + const httpError = { statusCode: 404, message: 'Not Found' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, httpError); + expect(isTransientTierClientError(error)).to.be.false; + }); + + it('returns false for not enrolled message (business logic error)', () => { + const error = new Error('Site not enrolled for product code ASO'); + expect(isTransientTierClientError(error)).to.be.false; + }); + + it('returns false for no entitlement message (business logic error)', () => { + const error = new Error('No entitlement found'); + expect(isTransientTierClientError(error)).to.be.false; + }); + + it('returns false for invalid product code message (business logic error)', () => { + const error = new Error('Invalid product code'); + expect(isTransientTierClientError(error)).to.be.false; + }); + + it('returns false for generic error', () => { + const error = new Error('Something went wrong'); + expect(isTransientTierClientError(error)).to.be.false; + }); + + it('returns false for DataAccessError with generic message and no cause', () => { + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }); + expect(isTransientTierClientError(error)).to.be.false; + }); + + it('returns false for PGRST100 (bad request - parsing error)', () => { + const pgrstError = { code: 'PGRST100', message: 'Parsing error in query string' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, pgrstError); + expect(isTransientTierClientError(error)).to.be.false; + }); + + it('returns false for PGRST202 (function not found)', () => { + const pgrstError = { code: 'PGRST202', message: 'Function not found' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, pgrstError); + expect(isTransientTierClientError(error)).to.be.false; + }); + + it('returns false for PGRST300 (configuration error)', () => { + const pgrstError = { code: 'PGRST300', message: 'JWT secret missing' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, pgrstError); + expect(isTransientTierClientError(error)).to.be.false; + }); + }); + + describe('edge cases', () => { + it('returns false for null error', () => { + expect(isTransientTierClientError(null)).to.be.false; + }); + + it('returns false for undefined error', () => { + expect(isTransientTierClientError(undefined)).to.be.false; + }); + + it('returns false for error with no message', () => { + const error = new Error(); + expect(isTransientTierClientError(error)).to.be.false; + }); + + it('returns false for error with empty message', () => { + const error = new Error(''); + expect(isTransientTierClientError(error)).to.be.false; + }); + + it('handles lowercase error codes in cause', () => { + const networkError = { code: 'econnrefused', message: 'Connection refused' }; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, networkError); + expect(isTransientTierClientError(error)).to.be.true; + }); + + it('walks multiple levels of cause chain', () => { + // Create a deeply nested error chain + const rootCause = { code: 'PGRST003', message: 'Timed out acquiring connection' }; + const middleError = new Error('Database operation failed'); + middleError.cause = rootCause; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, middleError); + expect(isTransientTierClientError(error)).to.be.true; + }); + + it('returns false when no level in cause chain is transient', () => { + const rootCause = { code: 'PGRST100', message: 'Bad request' }; + const middleError = new Error('Invalid query'); + middleError.cause = rootCause; + const error = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, middleError); + expect(isTransientTierClientError(error)).to.be.false; + }); + }); + }); +}); diff --git a/test/utils/site-validation.test.js b/test/utils/site-validation.test.js index 707dffe61f..a2df5a7223 100644 --- a/test/utils/site-validation.test.js +++ b/test/utils/site-validation.test.js @@ -15,10 +15,13 @@ import { expect, use } from 'chai'; import sinon from 'sinon'; import sinonChai from 'sinon-chai'; +import chaiAsPromised from 'chai-as-promised'; import { TierClient } from '@adobe/spacecat-shared-tier-client'; +import { DataAccessError } from '@adobe/spacecat-shared-data-access'; import { checkSiteRequiresValidation, IS_LLMO_OPPTY } from '../../src/utils/site-validation.js'; use(sinonChai); +use(chaiAsPromised); describe('utils/site-validation', () => { let sandbox; @@ -114,9 +117,11 @@ describe('utils/site-validation', () => { expect(result).to.equal(true); }); - it('returns false and logs warn when entitlement check throws', async () => { + it('returns false and logs warn when permanent error occurs', async () => { const site = { getId: sandbox.stub().returns('site-err') }; - sandbox.stub(TierClient, 'createForSite').rejects(new Error('boom')); + const permanentError = new Error('Not Found'); + permanentError.statusCode = 404; + sandbox.stub(TierClient, 'createForSite').rejects(permanentError); const result = await Promise.resolve(checkSiteRequiresValidation(site, context)); @@ -124,6 +129,33 @@ describe('utils/site-validation', () => { expect(context.log.warn).to.have.been.called; }); + it('throws when transient PGRST003 error occurs', async () => { + const site = { getId: sandbox.stub().returns('site-transient') }; + const pgrstError = { + code: 'PGRST003', + message: 'Timed out acquiring connection from connection pool', + }; + const dbError = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, pgrstError); + context.log.error = sandbox.spy(); + sandbox.stub(TierClient, 'createForSite').rejects(dbError); + + await expect(checkSiteRequiresValidation(site, context)) + .to.be.rejectedWith(Error, 'Transient entitlement check error'); + expect(context.log.error).to.have.been.called; + }); + + it('throws when transient network error occurs', async () => { + const site = { getId: sandbox.stub().returns('site-network-err') }; + const networkCause = { message: 'Network error occurred' }; + const networkError = new DataAccessError('Failed to query', { entityName: 'Entitlement' }, networkCause); + context.log.error = sandbox.spy(); + sandbox.stub(TierClient, 'createForSite').rejects(networkError); + + await expect(checkSiteRequiresValidation(site, context)) + .to.be.rejectedWith(Error, 'Transient entitlement check error'); + expect(context.log.error).to.have.been.called; + }); + // Removed legacy fallback tests since validation is entitlement-driven only it('returns false when no entitlement and site is not included', async () => { @@ -137,9 +169,10 @@ describe('utils/site-validation', () => { expect(result).to.equal(false); }); - it('logs warn when entitlement check throws', async () => { + it('logs warn when permanent error throws', async () => { const site = { getId: sandbox.stub().returns('site-warn-info') }; - sandbox.stub(TierClient, 'createForSite').rejects(new Error('boom')); + const permanentError = new Error('No entitlement found'); + sandbox.stub(TierClient, 'createForSite').rejects(permanentError); context.log.debug = sandbox.spy(); const result = await Promise.resolve(checkSiteRequiresValidation(site, context));