Skip to content

fix(core): add per-request timeout to Requester to catch silent fetch hangs#579

Merged
d-klotz merged 12 commits intonextfrom
claude/core-requester-timeout
Apr 27, 2026
Merged

fix(core): add per-request timeout to Requester to catch silent fetch hangs#579
d-klotz merged 12 commits intonextfrom
claude/core-requester-timeout

Conversation

@d-klotz
Copy link
Copy Markdown
Contributor

@d-klotz d-klotz commented Apr 21, 2026

Problem

Requester._request calls this.fetch(url, options) with no AbortController and no timeout. A silently-hung TCP connection (server accepts but never responds, keeps the socket alive) makes the returned promise never resolve. At batch scale the blast radius is catastrophic:

  • A single hung request stalls its Promise.allSettled batch forever.
  • That stalls the whole sync direction.
  • That stalls Promise.all([inbound, outbound]) in any bidirectional orchestration.
  • The worker lambda burns its entire budget (15 minutes by default for Frigg queue workers) waiting on a dead socket.

Reproduced in production against Clockwork Recruiting: individual POST /people/:id/web_sites and /email_addresses calls hung indefinitely while other requests responded normally. 33 such hangs surfaced in a single sync run once an adapter-level workaround was in place. Every Frigg integration using the framework's requester is exposed to this — not a Clockwork-specific issue.

Fix

  • New constructor parameter requestTimeoutMs with precedence:

    1. Instance param: new Api({ requestTimeoutMs: 30_000 })
    2. Class-level static: static requestTimeoutMs = 30_000
    3. Default: 60_000ms (60s)

    Pass 0 or null to disable (reserved for test doubles and documented long-running endpoints).

  • Fresh AbortController per attempt — the existing retry recursion uses its own backoff sleeps, so each _request invocation gets a clean signal. Timer cleared in finally regardless of outcome.

  • On timeout: throw a FetchError with isTimeout: true and timeoutMs: <N> properties so callers can distinguish timeouts from generic network errors without parsing the sanitized error message (FetchError strips body text outside STAGE=dev).

  • No retry on timeout: a slow endpoint is a downstream problem; each retry would wait another timeoutMs before giving up, amplifying the hang into a per-record multi-minute stall at batch scale. ECONNRESET retry path is unchanged.

Default choice (60s)

Conservative enough to not regress any integration relying on slow-but-legitimate calls (OAuth refresh to enterprise IdPs, bulk exports). Integrations that want tighter can opt in per-class. Clockwork specifically will use static requestTimeoutMs = 30_000 post-merge.

Test plan

  • 14 new unit tests covering config precedence, abort behavior, no-retry semantics, requestTimeoutMs=0 disabled path, signal wiring, and a regression guard on the existing ECONNRESET retry.
  • Tests use jest fake timers — full suite runs in sub-second real time.
  • packages/core/modules/ suite: 74/74 passing (2 pre-existing skips are live-API tests).
  • No changes to parsedBody, _get/_post/_patch/_put/_delete, or any public method signature.

Adopter impact

  • Default 60s is generous. I checked the sibling integrations (hubspot, clockwork, attio, quo) and none have documented long-running calls beyond OAuth refresh, which completes well under 60s.
  • No code changes required for existing integrations. If any downstream wants to tighten (or loosen) the timeout, they set the static on their Api class.
  • Once this ships in a canary, the clockwork-recruiting--frigg app can delete its adapter-level withTimeout workaround — the core fix supersedes it.

🤖 Generated with Claude Code

📦 Published PR as canary version: 2.0.0--canary.579.2d1eba8.0

✨ Test out this PR locally via:

npm install @friggframework/core@2.0.0--canary.579.2d1eba8.0
npm install @friggframework/devtools@2.0.0--canary.579.2d1eba8.0
npm install @friggframework/eslint-config@2.0.0--canary.579.2d1eba8.0
npm install @friggframework/prettier-config@2.0.0--canary.579.2d1eba8.0
npm install @friggframework/schemas@2.0.0--canary.579.2d1eba8.0
npm install @friggframework/serverless-plugin@2.0.0--canary.579.2d1eba8.0
npm install @friggframework/test@2.0.0--canary.579.2d1eba8.0
npm install @friggframework/ui@2.0.0--canary.579.2d1eba8.0
# or 
yarn add @friggframework/core@2.0.0--canary.579.2d1eba8.0
yarn add @friggframework/devtools@2.0.0--canary.579.2d1eba8.0
yarn add @friggframework/eslint-config@2.0.0--canary.579.2d1eba8.0
yarn add @friggframework/prettier-config@2.0.0--canary.579.2d1eba8.0
yarn add @friggframework/schemas@2.0.0--canary.579.2d1eba8.0
yarn add @friggframework/serverless-plugin@2.0.0--canary.579.2d1eba8.0
yarn add @friggframework/test@2.0.0--canary.579.2d1eba8.0
yarn add @friggframework/ui@2.0.0--canary.579.2d1eba8.0

… hangs

Closes the infinite-wait hazard in Requester._request. The framework
previously called `fetch()` with no AbortController and no timeout — a
silently-hung TCP connection (server accepts but never responds, keeps
the socket alive) made the returned promise never resolve. At batch
scale (Promise.allSettled over N records) a single hung request stalled
the whole batch, which stalled the sync direction, which stalled
Promise.all in any bidirectional orchestration, and ultimately ate the
full worker-lambda budget.

This reproduced in production against Clockwork Recruiting — individual
POST /people/:id/web_sites and /email_addresses calls hung indefinitely
while other requests responded normally. 33 such hangs surfaced in a
single sync run once the workaround was in place.

Change
- New constructor parameter `requestTimeoutMs` with precedence:
  instance param → class-level static → 60_000ms default. Pass 0 / null
  to disable.
- `_request` wires a fresh AbortController per attempt, passes its
  signal to fetch, clears the timer in finally. Fresh controller per
  attempt so the existing retry recursion (with its own backoff sleeps)
  always gets a clean signal.
- On AbortError: throw a FetchError with `isTimeout: true` and
  `timeoutMs: <N>` properties so callers can distinguish timeouts from
  generic network errors without parsing the sanitized error message.
- No retry on timeout: a slow endpoint is a downstream problem and each
  retry would wait another `timeoutMs` before giving up, amplifying the
  hang into a per-record multi-minute stall at batch scale. ECONNRESET
  retry path is unchanged.

Tests
- 14 new tests covering config precedence, abort behavior, no-retry
  semantics, disabled-timeout path, signal wiring, and a regression
  guard on the existing ECONNRESET retry. Uses jest fake timers so the
  suite runs in sub-second real time.

Adopter impact
- Default of 60s is generous (most OAuth refresh + API calls fit
  comfortably). Integrations that know they want tighter can opt in via
  `static requestTimeoutMs = 30_000` on the Api subclass.
- The clockwork-recruiting--frigg app can now delete its adapter-level
  timeout workaround.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@d-klotz d-klotz added the release Create a release when this pr is merged label Apr 21, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4f65577ef6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/core/modules/requester/requester.js Outdated
Addresses PR #579 review comment r3116786035 (Codex bot).

node-fetch v2 resolves the fetch() promise when response headers arrive,
not when the body is fully read. The original patch cleared the abort
timer in a finally block immediately after `await this.fetch(...)`
returned — so parsedBody(response) and FetchError.create()'s
response.text() call were both unprotected from a server that sent
headers and then stalled the body stream.

Scoped the timer's lifetime to the whole _request attempt:
- Extracted a clearRequestTimer() helper with idempotent semantics.
- Wrap the full method body in try / catch / finally. The finally now
  clears the timer regardless of whether the fetch, the status check,
  the body read, or the FetchError construction completes.
- Before each recursive retry (ECONNRESET, 429/5xx, 401 refresh), clear
  the current attempt's timer explicitly so the new attempt starts with
  its own fresh timer.
- Added _maybeFlagTimeoutDuringBodyRead() to surface the isTimeout /
  timeoutMs fields even when the AbortError fires mid-body-read (the
  body stream in node-fetch v2 rejects with AbortError on the 'error'
  event when the signal fires). Callers can keep using the same flag
  for both header-phase and body-phase timeouts.

Tests: added a 15th case that sends headers, then stalls inside
json()/text(), and verifies the timeout still catches it with
isTimeout=true / timeoutMs=100. Full suite still runs in sub-second real
time on fake timers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@d-klotz
Copy link
Copy Markdown
Contributor Author

d-klotz commented Apr 21, 2026

Code review

Found 1 issue (same as r3116786035 from the Codex reviewer, now fixed in commit 312fe8b):

  1. Timeout cleared before body consumption — node-fetch v2 resolves fetch() on headers-received, not on body-fully-read. parsedBody(response) and FetchError.create()'s response.text() read the body after the timer was cleared, so a server that sends headers and then stalls the body stream was still unprotected.

let response;
try {
response = await this.fetch(encodedUrl, fetchOptions);
} catch (e) {
// AbortController fires AbortError (name) / ETIMEDOUT-shaped
// errors (type on node-fetch) when we hit the timeout. No
// retry on timeout: a slow endpoint is a downstream problem,
// and each retry would wait another `timeoutMs` before giving
// up — amplifying the hang into a per-record multi-minute
// stall at batch scale.
const isTimeout =
e?.name === 'AbortError' || e?.type === 'aborted';
if (e?.code === 'ECONNRESET' && i < this.backOff.length) {
const delay = this.backOff[i] * 1000;
await new Promise((resolve) => setTimeout(resolve, delay));
return this._request(url, options, i + 1);
}
const fetchError = await FetchError.create({
resource: encodedUrl,
init: options,
responseBody: isTimeout
? `Request timed out after ${timeoutMs}ms`
: e,
});
if (isTimeout) {
// Flag + machine-readable fields so callers can
// distinguish a timeout from a generic network error
// without parsing the message (which FetchError
// sanitizes outside of STAGE=dev).
fetchError.isTimeout = true;
fetchError.timeoutMs = timeoutMs;
}
throw fetchError;
} finally {
if (timeoutHandle) clearTimeout(timeoutHandle);
}

Fixed by scoping the abort timer to the entire _request attempt: wrapped the whole method body in try/catch/finally, added a clearRequestTimer() helper that runs once, and cleared explicitly before each recursive retry (ECONNRESET, 429/5xx, 401 refresh) so each attempt starts with a fresh timer. Also added _maybeFlagTimeoutDuringBodyRead so callers get the same isTimeout / timeoutMs fields whether the abort fires during the header phase or mid-body-stream. New test case covers the headers-then-stalled-body scenario.

Other observations checked and ruled out as false positives:

  • Claim that node-fetch v2 ignores AbortSignal — debunked; v2.6.7 (the version shipped via packages/core/node_modules/node-fetch) honors signal and rejects with AbortError both during the request and on body stream errors (see node_modules/node-fetch/lib/index.js:394-400).
  • options.headers mutation across retries — pre-existing behavior outside the PR's scope.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

SonarQube Cloud flagged three style issues on PR #579:
- javascript:S1186 (MAJOR) × 2: `hangingFetch` and `okFetch` were declared
  inside the `describe('AbortController timeout behavior')` block; hoist
  them to module scope so each describe can reuse them without the
  nested-function cost.
- javascript:S3799 (MINOR): `okFetch(body = { ok: true })` uses an object
  literal as a default parameter. Replace with an `undefined`-check that
  picks up the default inside the body, avoiding the fresh-object-per-call
  allocation pattern the rule guards against.

No functional change; all 15 tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@d-klotz
Copy link
Copy Markdown
Contributor Author

d-klotz commented Apr 21, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2d1eba87eb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +37 to +40
instanceTimeout !== undefined && instanceTimeout !== null
? instanceTimeout
: this.constructor.requestTimeoutMs ??
DEFAULT_REQUEST_TIMEOUT_MS;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat null timeout override as disabled

The constructor currently treats requestTimeoutMs: null as “use class/default timeout” because null is excluded by instanceTimeout !== undefined && instanceTimeout !== null. That contradicts the new in-code API note saying null disables timeouts, so integrations passing null from config to opt out will still time out unexpectedly. Please align the condition with the documented behavior (or update docs/tests if fallback is intended).

Useful? React with 👍 / 👎.

// If the abort fired during body consumption, node-fetch emits
// the error as an AbortError on the body stream. Surface the
// same isTimeout flag callers use for header-phase timeouts.
throw this._maybeFlagTimeoutDuringBodyRead(e, timeoutMs);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Wrap body-read timeout aborts as FetchError

If a 4xx/5xx response stalls while FetchError.create is reading response.text(), the abort surfaces as an AbortError and the outer catch rethrows it directly after tagging timeout fields. That means timeout failures in this path no longer have the FetchError shape/status context that other requester failures expose, which can break callers relying on FetchError handling. Consider converting this abort into a FetchError consistently.

Useful? React with 👍 / 👎.

d-klotz and others added 3 commits April 21, 2026 11:12
…lStack

LocalStack queues were being created with only `QueueName`, dropping
every CloudFormation `Properties` field. AWS (via CloudFormation)
applies the full Properties block — notably `VisibilityTimeout: 1800`
that integration-builder emits for integration queues — so local
emulation diverges from production.

Observed impact: the HubspotQueue queue worker's handler runs for
minutes (initial sync of a real firm), but SQS's default 30 s
VisibilityTimeout makes the message re-appear as visible every 30 s.
The same message gets delivered to a second worker invocation while
the first is still processing, which breaks the "one worker per
message" invariant Frigg assumes.

Change
- `extractQueueDefinitions` now looks up each queue's matching
  `AWS::SQS::Queue` resource in `serverless.service.resources.Resources`
  and attaches its `Properties` to the queue definition. If the
  resource is missing, non-SQS, or has no Properties, the behavior is
  unchanged (back-compat).
- `LocalStackQueueService.createQueue` accepts an optional
  `attributes` argument and forwards it to SQS `CreateQueue`. When
  `attributes` is missing/empty the call shape matches the previous
  `{QueueName}`-only behavior.
- `LocalStackQueueService._propertiesToAttributes` whitelists the 15
  CloudFormation Properties that map 1:1 onto SQS CreateQueue
  Attributes and drops everything else (Tags, QueueName, etc.). Object
  values like `RedrivePolicy` get JSON-encoded as AWS expects.
- `createQueues` wires the two together.

Tests: 39/39 pass (9 new cases covering the Properties-forwarding path,
back-compat for definitions without resources, and the property-to-
attribute mapping).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rom queue attributes

Addresses Codex review P1 on PR #580.

Serializing every CloudFormation `Properties` value with
`JSON.stringify` would forward unresolved CloudFormation intrinsics
(`Fn::GetAtt`, `Ref`, `Fn::Sub`, …) into LocalStack's `CreateQueue`
call. The concrete case: integration-builder.js emits

  RedrivePolicy: {
    maxReceiveCount: 3,
    deadLetterTargetArn: { 'Fn::GetAtt': ['InternalErrorQueue', 'Arn'] }
  }

CloudFormation resolves the Fn::GetAtt to a real ARN in AWS, but at
plugin-time the value is still the raw intrinsic object. SQS's
`RedrivePolicy` requires `deadLetterTargetArn` to be a valid ARN
string — passing the JSON blob either fails the CreateQueue call or
produces a queue with malformed DLQ config.

Change
- Added `LocalStackQueueService._containsUnresolvedIntrinsic(value)`:
  recursively detects `Fn::*` and `Ref` keys in CloudFormation Property
  values.
- `_propertiesToAttributes` now checks each attribute's value for
  unresolved intrinsics and DROPS the attribute (with a console.warn)
  instead of stringifying it. The rest of the attributes pass through
  untouched — so `VisibilityTimeout` (the main win of this PR) still
  gets applied locally; the DLQ association is skipped until the
  plugin gains a proper intrinsic resolver.

Tests
- 4 new unit tests covering the intrinsic-filter behavior, `Ref`
  handling, nested intrinsic detection, and the happy path where a
  resolved ARN string passes through untouched.
- 43/43 plugin tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lags

`UpdateProcessMetrics` and `UpdateProcessState` both did read-modify-
write against `Process.context` / `Process.results` JSON blobs. Under
concurrent queue-worker invocations (any integration with
`reservedConcurrency > 1`) the spread-and-save pattern clobbers later
writes with stale snapshots: counters drift low and context flags such
as `fetchDone` silently revert to undefined, leaving syncs stuck in
non-terminal states.

Adds `applyProcessUpdate(processId, ops)` to the ProcessRepository
interface and implements it natively in each adapter:

  - PostgreSQL: one UPDATE ... RETURNING * with nested `jsonb_set`
    expressions for increments, sets, and bounded-array pushes. Row
    locking inside the single statement serializes concurrent callers.
  - MongoDB: findAndModify via `$runCommandRaw` with `$inc` / `$set`
    / `$push` + `$slice`. Server-side atomic.
  - DocumentDB: same operator set as Mongo; adds compat note for
    `$slice` (requires DocumentDB 4.0+).

Validation (regex-checked dot-paths rooted at `context|results`,
whitelisted op kinds) lives in a shared `process-update-ops-shared`
module so every adapter fails fast before touching the DB.

Refactors `UpdateProcessMetrics` to split into an atomic phase
(counters + bounded error history) and a best-effort non-atomic
follow-up for derived fields (duration, recordsPerSecond,
estimatedCompletion) so existing consumers of those fields see no
contract change. Derived-field write failures are logged and do NOT
fail the overall call — the atomic counters already landed in phase 1.

Refactors `UpdateProcessState` to route context updates through
`applyProcessUpdate` when the repo supports it, falling back to the
original read-merge-write path for custom repos that don't. Existing
"shallow top-level merge" semantics are preserved.

Removes the long-standing TODO banner on `update-process-metrics.js`
documenting this exact race. Adds 57 unit tests covering validation,
op shapes, race simulation, empty-batch short-circuit, derived-field
non-fatal failures, and the legacy fallback in `UpdateProcessState`.

This commit is additive: `repository.update(id, patch)` is unchanged,
no schema migration, no behavior change for callers that don't opt in
to `applyProcessUpdate`.

Note on branch scoping: this change lands on `claude/plugin-queue-
attributes` per downstream consumer request. The branch now mixes a
serverless-plugin fix (earlier commits) with a core fix; consumers
reviewing this PR should treat the two concerns as separable for
bisection purposes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d-klotz and others added 6 commits April 21, 2026 15:38
Addresses two correctness issues flagged on PR #580 review of
c0487bc:

B1 (Postgres pushSlice ordering): `jsonb_agg(elem)` without an
explicit `ORDER BY` clause produces implementation-defined order even
over a `WITH ORDINALITY` subquery. Added `ORDER BY idx` inside the
aggregate so insertion order is preserved deterministically across
Postgres versions. While there: wrapped the combined-array expression
in a CTE so `${newArr}` is evaluated once on the server rather than
reconstructed three times in the outer expression.

B2 (intermediate-path silent no-op): `jsonb_set(target, path, value,
create_missing=true)` only creates the LEAF segment if missing —
intermediate segments that don't exist as objects cause the call to
return `target` unchanged. For depth-1 paths (our current only callers)
this was fine, but the interface JSDoc advertises arbitrary depth
(`context.pagination.cursor` is in the example). Added an
`ensureParents` helper that wraps each intermediate prefix path in a
`jsonb_set(..., COALESCE(cur #> path, '{}'::jsonb), true)` call,
preserving existing content and synthesizing `{}` where missing.

Known artifact documented in the generated-SQL test: the string-
substitution nature of `ensureParents` causes SQL size to grow as
O(2^N) in path depth. Fine for realistic depths (≤3 segments); a
future optimization could materialize `cur` once per iteration via
CTE to make it O(N). No runtime correctness impact — Postgres still
executes the whole expression as a single UPDATE under a single row
lock.

Adds `process-repository-postgres.test.js` with 10 SQL-generation
tests covering:
  - state-only UPDATE (no jsonb_set chains)
  - depth-1 increment (single jsonb_set wrap, no parent synthesis)
  - depth-2 set (1 parent synthesis + leaf)
  - depth-3 set (ensureParents substitution artifact documented)
  - pushSlice emits `ORDER BY idx` and binds values exactly once
  - parameter binding order is stable left-to-right
  - combined increment/set/pushSlice/newState in one UPDATE
  - null return when UPDATE hits zero rows
  - validateOps rejects invalid paths before DB call
  - depth-1 set skips parent synthesis (no wasted jsonb_set)

Test coverage gap remaining (follow-up): no real-Postgres race test —
the Frigg monorepo has MongoDB-memory-server for Mongo, but no
equivalent for Postgres. Adding one requires a team decision on pg-mem
vs testcontainers vs CI-provisioned. The SQL-generation tests here +
Postgres's documented single-UPDATE atomicity provide the correctness
argument without the infrastructure cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Postgres schema at `packages/core/prisma-postgresql/schema.prisma`
declares two artifacts that no migration in `prisma-postgresql/migrations/`
ever creates. `prisma migrate deploy` runs to completion (there's
nothing to apply for those declarations) and the generated client is
happy, so the drift only surfaces as P2021 / P2022 when a downstream
app actually writes to the affected paths.

Missing pieces:

1. `Entity.data Json @default("{}")` was added to the schema but
   never migrated. ModuleRepositoryPostgres.createEntity / findEntity
   persist and rehydrate any `identifiers`/`details` fields that fall
   outside the six named columns through this JSONB blob — so the
   moment an integration's `getEntityDetails` returns a field like
   `firm_subdomain`, prisma.entity.create throws
   `Unknown argument 'userId'. Did you mean 'user'?` (the first
   argument Prisma can't place is the one it names in the error —
   confusingly unrelated to the real cause). Added
   `20260422120000_add_entity_data_column` — a single
   `ALTER TABLE "Entity" ADD COLUMN IF NOT EXISTS "data" JSONB NOT
   NULL DEFAULT '{}'` so the migration is safe to re-apply on
   environments that may have worked around the drift locally.

2. The entire `Process` model is declared in the schema but no
   migration creates the table, its six indexes, or the three FK
   constraints (user, integration, self-referential parent).
   ProcessRepositoryPostgres and FriggProcessManager depend on this
   for long-running-job state tracking — fan-out sync progress,
   batched state machines — so any app that uses the new process
   primitive hits P2021 on first `prisma.process.create`. Added
   `20260422120001_create_process_table` with the table definition,
   all declared indexes, and the three FK constraints matching the
   relations in schema.prisma (userId→User CASCADE,
   integrationId→Integration CASCADE, parentProcessId→Process
   SET NULL).

Downstream apps currently working around these with local patch
migrations (e.g. lefthookhq/clockwork-recruiting--frigg's
`scripts/patch-core-migrations.js`) can delete that workaround after
bumping to the release that includes this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Under heavy concurrent fan-out sync load, worker Lambdas were hanging the
full 900s timeout with no logs or errors. Root cause: three compounding
infrastructure defaults.

1. Prisma engineType="binary" — schema.prisma forced the binary engine,
   which forks a separate Rust query-engine child and communicates over
   an HTTP/IPC pipe that has NO client-side timeout on the Node side. A
   zombied child wedges the calling process until Lambda kills it.
   Switch both postgres and mongodb schemas to engineType="library" (the
   Prisma 3.x+ default), which loads the Rust engine as a Node-API addon
   in-process. Native crashes now surface as Runtime.ExitError instead of
   15-min hangs.

2. Aurora Serverless v2 MaxCapacity=1 ACU — at 0.5–1 ACU Aurora is CPU-
   bound under 20-way concurrent writes, inflating query latency and
   compounding pool pressure. Bump the default to 4 ACU; apps can still
   override via dbConfig.maxCapacity. Scales to MinCapacity when idle so
   cost impact is minimal.

3. DATABASE_URL had no pool or timeout params — a blocked query or row
   lock would wait forever. Append Prisma + libpq params at URL build
   time:
     connection_limit=1 (one pg conn per Lambda container; rely on Lambda
                         concurrency for parallelism — AWS best practice)
     pool_timeout=10    (throw P2024 after 10s instead of hanging)
     connect_timeout=10 (bound TCP/TLS handshake)
     socket_timeout=60  (kill dead sockets when server never responds)
     options=-c statement_timeout=30000 -c lock_timeout=10000
                        (Postgres-side 30s query cap / 10s lock-wait cap
                         — aborts with SQLSTATE 57014 / 55P03 rather than
                         silently sitting in pg_stat_activity)

Any combination of these defaults could mask the other two; all three
had to ship together to make workers fail loud and fast instead of
silently hanging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to f1cb41c, addressing code-review concerns:

1. connection_limit 1 → 2, pool_timeout 10 → 20

   core has multiple in-handler Promise.all patterns against Prisma:
   get-process.executeMany (parallel fetch), field-encryption-service
   (batched decrypt). With connection_limit=1 these would serialize
   behind a single pool slot, and under Aurora pressure each later
   query gets less remaining budget before P2024. Bumping to 2 removes
   the cliff while staying safe against max_connections (200 Lambdas ×
   2 = 400 conns, fits aurora-pg 15 at 4 ACU). pool_timeout=20 still
   fails fast relative to Lambda's 900s cap but gives fan-outs headroom.

2. Cover the use-existing and discover-no-secret runtime-construction paths

   Those paths export DATABASE_HOST / DATABASE_PORT / DATABASE_NAME /
   DATABASE_USER / DATABASE_PASSWORD and expect consumer code to
   assemble DATABASE_URL at runtime. Previously got NONE of the new
   timeouts — a real gap. Extract LAMBDA_DATABASE_URL_QUERY_PARAMS to
   a module-scoped constant and export it as a `DATABASE_URL_PARAMS`
   env var on those two paths, with a log hint telling consumers to
   append `?${DATABASE_URL_PARAMS}` when building the URL. Single
   source of truth for the params; no drift between paths.

3. Tests now assert all five param components (connection_limit,
   pool_timeout, connect_timeout, socket_timeout, statement_timeout,
   lock_timeout) so accidental removal of any is caught.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the GET /api/authorize handler dropped req.query.state on the
floor: it called getModuleInstanceFromType.execute(userId, entityType)
without forwarding any extra params, so the Module constructor's apiParams
never received state. As a result, the OAuth2Requester base — which
already accepts `state` via get(params, 'state', null) — always saw
state = null, and downstream API modules that hardcode &state=${this.state}
in their authorize URL (e.g. api-module-hubspot) interpolated state=null.

This commit threads state through:
- integration-router.js: GET /api/authorize passes
  { state: req.query.state } as the third arg.
- get-module-instance-from-type.js: execute(userId, type, options = {})
  forwards options.state to the Module constructor.
- module.js: Module constructor accepts state and merges { state } into
  apiParams when truthy, so OAuth2Requester picks it up.

Backwards-compatible: state defaults to null, no callers need to change.

Modules that conditionally append state (e.g. api-module-clockwork-recruiting
which uses `if (this.state) params.append('state', ...)`) continue to omit
the param when no state is provided.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…butes

fix(serverless-plugin): apply CloudFormation queue Properties to LocalStack
@d-klotz d-klotz merged commit 459050a into next Apr 27, 2026
5 of 6 checks passed
@d-klotz d-klotz deleted the claude/core-requester-timeout branch April 27, 2026 21:22
@seanspeaks
Copy link
Copy Markdown
Contributor

🚀 PR was released in v2.0.0-next.81 🚀

@seanspeaks seanspeaks added the prerelease This change is available in a prerelease. label Apr 27, 2026
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

prerelease This change is available in a prerelease. release Create a release when this pr is merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants