fix(core): add per-request timeout to Requester to catch silent fetch hangs#579
fix(core): add per-request timeout to Requester to catch silent fetch hangs#579
Conversation
… hangs Closes the infinite-wait hazard in Requester._request. The framework previously called `fetch()` with no AbortController and no timeout — a silently-hung TCP connection (server accepts but never responds, keeps the socket alive) made the returned promise never resolve. At batch scale (Promise.allSettled over N records) a single hung request stalled the whole batch, which stalled the sync direction, which stalled Promise.all in any bidirectional orchestration, and ultimately ate the full worker-lambda budget. This reproduced in production against Clockwork Recruiting — individual POST /people/:id/web_sites and /email_addresses calls hung indefinitely while other requests responded normally. 33 such hangs surfaced in a single sync run once the workaround was in place. Change - New constructor parameter `requestTimeoutMs` with precedence: instance param → class-level static → 60_000ms default. Pass 0 / null to disable. - `_request` wires a fresh AbortController per attempt, passes its signal to fetch, clears the timer in finally. Fresh controller per attempt so the existing retry recursion (with its own backoff sleeps) always gets a clean signal. - On AbortError: throw a FetchError with `isTimeout: true` and `timeoutMs: <N>` properties so callers can distinguish timeouts from generic network errors without parsing the sanitized error message. - No retry on timeout: a slow endpoint is a downstream problem and each retry would wait another `timeoutMs` before giving up, amplifying the hang into a per-record multi-minute stall at batch scale. ECONNRESET retry path is unchanged. Tests - 14 new tests covering config precedence, abort behavior, no-retry semantics, disabled-timeout path, signal wiring, and a regression guard on the existing ECONNRESET retry. Uses jest fake timers so the suite runs in sub-second real time. Adopter impact - Default of 60s is generous (most OAuth refresh + API calls fit comfortably). Integrations that know they want tighter can opt in via `static requestTimeoutMs = 30_000` on the Api subclass. - The clockwork-recruiting--frigg app can now delete its adapter-level timeout workaround. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4f65577ef6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Addresses PR #579 review comment r3116786035 (Codex bot). node-fetch v2 resolves the fetch() promise when response headers arrive, not when the body is fully read. The original patch cleared the abort timer in a finally block immediately after `await this.fetch(...)` returned — so parsedBody(response) and FetchError.create()'s response.text() call were both unprotected from a server that sent headers and then stalled the body stream. Scoped the timer's lifetime to the whole _request attempt: - Extracted a clearRequestTimer() helper with idempotent semantics. - Wrap the full method body in try / catch / finally. The finally now clears the timer regardless of whether the fetch, the status check, the body read, or the FetchError construction completes. - Before each recursive retry (ECONNRESET, 429/5xx, 401 refresh), clear the current attempt's timer explicitly so the new attempt starts with its own fresh timer. - Added _maybeFlagTimeoutDuringBodyRead() to surface the isTimeout / timeoutMs fields even when the AbortError fires mid-body-read (the body stream in node-fetch v2 rejects with AbortError on the 'error' event when the signal fires). Callers can keep using the same flag for both header-phase and body-phase timeouts. Tests: added a 15th case that sends headers, then stalls inside json()/text(), and verifies the timeout still catches it with isTimeout=true / timeoutMs=100. Full suite still runs in sub-second real time on fake timers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code reviewFound 1 issue (same as r3116786035 from the Codex reviewer, now fixed in commit 312fe8b):
frigg/packages/core/modules/requester/requester.js Lines 90 to 125 in 4f65577 Fixed by scoping the abort timer to the entire Other observations checked and ruled out as false positives:
🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
SonarQube Cloud flagged three style issues on PR #579: - javascript:S1186 (MAJOR) × 2: `hangingFetch` and `okFetch` were declared inside the `describe('AbortController timeout behavior')` block; hoist them to module scope so each describe can reuse them without the nested-function cost. - javascript:S3799 (MINOR): `okFetch(body = { ok: true })` uses an object literal as a default parameter. Replace with an `undefined`-check that picks up the default inside the body, avoiding the fresh-object-per-call allocation pattern the rule guards against. No functional change; all 15 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2d1eba87eb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| instanceTimeout !== undefined && instanceTimeout !== null | ||
| ? instanceTimeout | ||
| : this.constructor.requestTimeoutMs ?? | ||
| DEFAULT_REQUEST_TIMEOUT_MS; |
There was a problem hiding this comment.
Treat
null timeout override as disabled
The constructor currently treats requestTimeoutMs: null as “use class/default timeout” because null is excluded by instanceTimeout !== undefined && instanceTimeout !== null. That contradicts the new in-code API note saying null disables timeouts, so integrations passing null from config to opt out will still time out unexpectedly. Please align the condition with the documented behavior (or update docs/tests if fallback is intended).
Useful? React with 👍 / 👎.
| // If the abort fired during body consumption, node-fetch emits | ||
| // the error as an AbortError on the body stream. Surface the | ||
| // same isTimeout flag callers use for header-phase timeouts. | ||
| throw this._maybeFlagTimeoutDuringBodyRead(e, timeoutMs); |
There was a problem hiding this comment.
Wrap body-read timeout aborts as FetchError
If a 4xx/5xx response stalls while FetchError.create is reading response.text(), the abort surfaces as an AbortError and the outer catch rethrows it directly after tagging timeout fields. That means timeout failures in this path no longer have the FetchError shape/status context that other requester failures expose, which can break callers relying on FetchError handling. Consider converting this abort into a FetchError consistently.
Useful? React with 👍 / 👎.
…lStack
LocalStack queues were being created with only `QueueName`, dropping
every CloudFormation `Properties` field. AWS (via CloudFormation)
applies the full Properties block — notably `VisibilityTimeout: 1800`
that integration-builder emits for integration queues — so local
emulation diverges from production.
Observed impact: the HubspotQueue queue worker's handler runs for
minutes (initial sync of a real firm), but SQS's default 30 s
VisibilityTimeout makes the message re-appear as visible every 30 s.
The same message gets delivered to a second worker invocation while
the first is still processing, which breaks the "one worker per
message" invariant Frigg assumes.
Change
- `extractQueueDefinitions` now looks up each queue's matching
`AWS::SQS::Queue` resource in `serverless.service.resources.Resources`
and attaches its `Properties` to the queue definition. If the
resource is missing, non-SQS, or has no Properties, the behavior is
unchanged (back-compat).
- `LocalStackQueueService.createQueue` accepts an optional
`attributes` argument and forwards it to SQS `CreateQueue`. When
`attributes` is missing/empty the call shape matches the previous
`{QueueName}`-only behavior.
- `LocalStackQueueService._propertiesToAttributes` whitelists the 15
CloudFormation Properties that map 1:1 onto SQS CreateQueue
Attributes and drops everything else (Tags, QueueName, etc.). Object
values like `RedrivePolicy` get JSON-encoded as AWS expects.
- `createQueues` wires the two together.
Tests: 39/39 pass (9 new cases covering the Properties-forwarding path,
back-compat for definitions without resources, and the property-to-
attribute mapping).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rom queue attributes Addresses Codex review P1 on PR #580. Serializing every CloudFormation `Properties` value with `JSON.stringify` would forward unresolved CloudFormation intrinsics (`Fn::GetAtt`, `Ref`, `Fn::Sub`, …) into LocalStack's `CreateQueue` call. The concrete case: integration-builder.js emits RedrivePolicy: { maxReceiveCount: 3, deadLetterTargetArn: { 'Fn::GetAtt': ['InternalErrorQueue', 'Arn'] } } CloudFormation resolves the Fn::GetAtt to a real ARN in AWS, but at plugin-time the value is still the raw intrinsic object. SQS's `RedrivePolicy` requires `deadLetterTargetArn` to be a valid ARN string — passing the JSON blob either fails the CreateQueue call or produces a queue with malformed DLQ config. Change - Added `LocalStackQueueService._containsUnresolvedIntrinsic(value)`: recursively detects `Fn::*` and `Ref` keys in CloudFormation Property values. - `_propertiesToAttributes` now checks each attribute's value for unresolved intrinsics and DROPS the attribute (with a console.warn) instead of stringifying it. The rest of the attributes pass through untouched — so `VisibilityTimeout` (the main win of this PR) still gets applied locally; the DLQ association is skipped until the plugin gains a proper intrinsic resolver. Tests - 4 new unit tests covering the intrinsic-filter behavior, `Ref` handling, nested intrinsic detection, and the happy path where a resolved ARN string passes through untouched. - 43/43 plugin tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lags
`UpdateProcessMetrics` and `UpdateProcessState` both did read-modify-
write against `Process.context` / `Process.results` JSON blobs. Under
concurrent queue-worker invocations (any integration with
`reservedConcurrency > 1`) the spread-and-save pattern clobbers later
writes with stale snapshots: counters drift low and context flags such
as `fetchDone` silently revert to undefined, leaving syncs stuck in
non-terminal states.
Adds `applyProcessUpdate(processId, ops)` to the ProcessRepository
interface and implements it natively in each adapter:
- PostgreSQL: one UPDATE ... RETURNING * with nested `jsonb_set`
expressions for increments, sets, and bounded-array pushes. Row
locking inside the single statement serializes concurrent callers.
- MongoDB: findAndModify via `$runCommandRaw` with `$inc` / `$set`
/ `$push` + `$slice`. Server-side atomic.
- DocumentDB: same operator set as Mongo; adds compat note for
`$slice` (requires DocumentDB 4.0+).
Validation (regex-checked dot-paths rooted at `context|results`,
whitelisted op kinds) lives in a shared `process-update-ops-shared`
module so every adapter fails fast before touching the DB.
Refactors `UpdateProcessMetrics` to split into an atomic phase
(counters + bounded error history) and a best-effort non-atomic
follow-up for derived fields (duration, recordsPerSecond,
estimatedCompletion) so existing consumers of those fields see no
contract change. Derived-field write failures are logged and do NOT
fail the overall call — the atomic counters already landed in phase 1.
Refactors `UpdateProcessState` to route context updates through
`applyProcessUpdate` when the repo supports it, falling back to the
original read-merge-write path for custom repos that don't. Existing
"shallow top-level merge" semantics are preserved.
Removes the long-standing TODO banner on `update-process-metrics.js`
documenting this exact race. Adds 57 unit tests covering validation,
op shapes, race simulation, empty-batch short-circuit, derived-field
non-fatal failures, and the legacy fallback in `UpdateProcessState`.
This commit is additive: `repository.update(id, patch)` is unchanged,
no schema migration, no behavior change for callers that don't opt in
to `applyProcessUpdate`.
Note on branch scoping: this change lands on `claude/plugin-queue-
attributes` per downstream consumer request. The branch now mixes a
serverless-plugin fix (earlier commits) with a core fix; consumers
reviewing this PR should treat the two concerns as separable for
bisection purposes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses two correctness issues flagged on PR #580 review of c0487bc: B1 (Postgres pushSlice ordering): `jsonb_agg(elem)` without an explicit `ORDER BY` clause produces implementation-defined order even over a `WITH ORDINALITY` subquery. Added `ORDER BY idx` inside the aggregate so insertion order is preserved deterministically across Postgres versions. While there: wrapped the combined-array expression in a CTE so `${newArr}` is evaluated once on the server rather than reconstructed three times in the outer expression. B2 (intermediate-path silent no-op): `jsonb_set(target, path, value, create_missing=true)` only creates the LEAF segment if missing — intermediate segments that don't exist as objects cause the call to return `target` unchanged. For depth-1 paths (our current only callers) this was fine, but the interface JSDoc advertises arbitrary depth (`context.pagination.cursor` is in the example). Added an `ensureParents` helper that wraps each intermediate prefix path in a `jsonb_set(..., COALESCE(cur #> path, '{}'::jsonb), true)` call, preserving existing content and synthesizing `{}` where missing. Known artifact documented in the generated-SQL test: the string- substitution nature of `ensureParents` causes SQL size to grow as O(2^N) in path depth. Fine for realistic depths (≤3 segments); a future optimization could materialize `cur` once per iteration via CTE to make it O(N). No runtime correctness impact — Postgres still executes the whole expression as a single UPDATE under a single row lock. Adds `process-repository-postgres.test.js` with 10 SQL-generation tests covering: - state-only UPDATE (no jsonb_set chains) - depth-1 increment (single jsonb_set wrap, no parent synthesis) - depth-2 set (1 parent synthesis + leaf) - depth-3 set (ensureParents substitution artifact documented) - pushSlice emits `ORDER BY idx` and binds values exactly once - parameter binding order is stable left-to-right - combined increment/set/pushSlice/newState in one UPDATE - null return when UPDATE hits zero rows - validateOps rejects invalid paths before DB call - depth-1 set skips parent synthesis (no wasted jsonb_set) Test coverage gap remaining (follow-up): no real-Postgres race test — the Frigg monorepo has MongoDB-memory-server for Mongo, but no equivalent for Postgres. Adding one requires a team decision on pg-mem vs testcontainers vs CI-provisioned. The SQL-generation tests here + Postgres's documented single-UPDATE atomicity provide the correctness argument without the infrastructure cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Postgres schema at `packages/core/prisma-postgresql/schema.prisma` declares two artifacts that no migration in `prisma-postgresql/migrations/` ever creates. `prisma migrate deploy` runs to completion (there's nothing to apply for those declarations) and the generated client is happy, so the drift only surfaces as P2021 / P2022 when a downstream app actually writes to the affected paths. Missing pieces: 1. `Entity.data Json @default("{}")` was added to the schema but never migrated. ModuleRepositoryPostgres.createEntity / findEntity persist and rehydrate any `identifiers`/`details` fields that fall outside the six named columns through this JSONB blob — so the moment an integration's `getEntityDetails` returns a field like `firm_subdomain`, prisma.entity.create throws `Unknown argument 'userId'. Did you mean 'user'?` (the first argument Prisma can't place is the one it names in the error — confusingly unrelated to the real cause). Added `20260422120000_add_entity_data_column` — a single `ALTER TABLE "Entity" ADD COLUMN IF NOT EXISTS "data" JSONB NOT NULL DEFAULT '{}'` so the migration is safe to re-apply on environments that may have worked around the drift locally. 2. The entire `Process` model is declared in the schema but no migration creates the table, its six indexes, or the three FK constraints (user, integration, self-referential parent). ProcessRepositoryPostgres and FriggProcessManager depend on this for long-running-job state tracking — fan-out sync progress, batched state machines — so any app that uses the new process primitive hits P2021 on first `prisma.process.create`. Added `20260422120001_create_process_table` with the table definition, all declared indexes, and the three FK constraints matching the relations in schema.prisma (userId→User CASCADE, integrationId→Integration CASCADE, parentProcessId→Process SET NULL). Downstream apps currently working around these with local patch migrations (e.g. lefthookhq/clockwork-recruiting--frigg's `scripts/patch-core-migrations.js`) can delete that workaround after bumping to the release that includes this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Under heavy concurrent fan-out sync load, worker Lambdas were hanging the
full 900s timeout with no logs or errors. Root cause: three compounding
infrastructure defaults.
1. Prisma engineType="binary" — schema.prisma forced the binary engine,
which forks a separate Rust query-engine child and communicates over
an HTTP/IPC pipe that has NO client-side timeout on the Node side. A
zombied child wedges the calling process until Lambda kills it.
Switch both postgres and mongodb schemas to engineType="library" (the
Prisma 3.x+ default), which loads the Rust engine as a Node-API addon
in-process. Native crashes now surface as Runtime.ExitError instead of
15-min hangs.
2. Aurora Serverless v2 MaxCapacity=1 ACU — at 0.5–1 ACU Aurora is CPU-
bound under 20-way concurrent writes, inflating query latency and
compounding pool pressure. Bump the default to 4 ACU; apps can still
override via dbConfig.maxCapacity. Scales to MinCapacity when idle so
cost impact is minimal.
3. DATABASE_URL had no pool or timeout params — a blocked query or row
lock would wait forever. Append Prisma + libpq params at URL build
time:
connection_limit=1 (one pg conn per Lambda container; rely on Lambda
concurrency for parallelism — AWS best practice)
pool_timeout=10 (throw P2024 after 10s instead of hanging)
connect_timeout=10 (bound TCP/TLS handshake)
socket_timeout=60 (kill dead sockets when server never responds)
options=-c statement_timeout=30000 -c lock_timeout=10000
(Postgres-side 30s query cap / 10s lock-wait cap
— aborts with SQLSTATE 57014 / 55P03 rather than
silently sitting in pg_stat_activity)
Any combination of these defaults could mask the other two; all three
had to ship together to make workers fail loud and fast instead of
silently hanging.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to f1cb41c, addressing code-review concerns: 1. connection_limit 1 → 2, pool_timeout 10 → 20 core has multiple in-handler Promise.all patterns against Prisma: get-process.executeMany (parallel fetch), field-encryption-service (batched decrypt). With connection_limit=1 these would serialize behind a single pool slot, and under Aurora pressure each later query gets less remaining budget before P2024. Bumping to 2 removes the cliff while staying safe against max_connections (200 Lambdas × 2 = 400 conns, fits aurora-pg 15 at 4 ACU). pool_timeout=20 still fails fast relative to Lambda's 900s cap but gives fan-outs headroom. 2. Cover the use-existing and discover-no-secret runtime-construction paths Those paths export DATABASE_HOST / DATABASE_PORT / DATABASE_NAME / DATABASE_USER / DATABASE_PASSWORD and expect consumer code to assemble DATABASE_URL at runtime. Previously got NONE of the new timeouts — a real gap. Extract LAMBDA_DATABASE_URL_QUERY_PARAMS to a module-scoped constant and export it as a `DATABASE_URL_PARAMS` env var on those two paths, with a log hint telling consumers to append `?${DATABASE_URL_PARAMS}` when building the URL. Single source of truth for the params; no drift between paths. 3. Tests now assert all five param components (connection_limit, pool_timeout, connect_timeout, socket_timeout, statement_timeout, lock_timeout) so accidental removal of any is caught. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the GET /api/authorize handler dropped req.query.state on the
floor: it called getModuleInstanceFromType.execute(userId, entityType)
without forwarding any extra params, so the Module constructor's apiParams
never received state. As a result, the OAuth2Requester base — which
already accepts `state` via get(params, 'state', null) — always saw
state = null, and downstream API modules that hardcode &state=${this.state}
in their authorize URL (e.g. api-module-hubspot) interpolated state=null.
This commit threads state through:
- integration-router.js: GET /api/authorize passes
{ state: req.query.state } as the third arg.
- get-module-instance-from-type.js: execute(userId, type, options = {})
forwards options.state to the Module constructor.
- module.js: Module constructor accepts state and merges { state } into
apiParams when truthy, so OAuth2Requester picks it up.
Backwards-compatible: state defaults to null, no callers need to change.
Modules that conditionally append state (e.g. api-module-clockwork-recruiting
which uses `if (this.state) params.append('state', ...)`) continue to omit
the param when no state is provided.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…butes fix(serverless-plugin): apply CloudFormation queue Properties to LocalStack
|
🚀 PR was released in |
|



Problem
Requester._requestcallsthis.fetch(url, options)with noAbortControllerand no timeout. A silently-hung TCP connection (server accepts but never responds, keeps the socket alive) makes the returned promise never resolve. At batch scale the blast radius is catastrophic:Promise.allSettledbatch forever.Promise.all([inbound, outbound])in any bidirectional orchestration.Reproduced in production against Clockwork Recruiting: individual
POST /people/:id/web_sitesand/email_addressescalls hung indefinitely while other requests responded normally. 33 such hangs surfaced in a single sync run once an adapter-level workaround was in place. Every Frigg integration using the framework's requester is exposed to this — not a Clockwork-specific issue.Fix
New constructor parameter
requestTimeoutMswith precedence:new Api({ requestTimeoutMs: 30_000 })static requestTimeoutMs = 30_00060_000ms(60s)Pass
0ornullto disable (reserved for test doubles and documented long-running endpoints).Fresh
AbortControllerper attempt — the existing retry recursion uses its own backoff sleeps, so each_requestinvocation gets a clean signal. Timer cleared infinallyregardless of outcome.On timeout: throw a
FetchErrorwithisTimeout: trueandtimeoutMs: <N>properties so callers can distinguish timeouts from generic network errors without parsing the sanitized error message (FetchError strips body text outsideSTAGE=dev).No retry on timeout: a slow endpoint is a downstream problem; each retry would wait another
timeoutMsbefore giving up, amplifying the hang into a per-record multi-minute stall at batch scale.ECONNRESETretry path is unchanged.Default choice (60s)
Conservative enough to not regress any integration relying on slow-but-legitimate calls (OAuth refresh to enterprise IdPs, bulk exports). Integrations that want tighter can opt in per-class. Clockwork specifically will use
static requestTimeoutMs = 30_000post-merge.Test plan
requestTimeoutMs=0disabled path, signal wiring, and a regression guard on the existingECONNRESETretry.packages/core/modules/suite: 74/74 passing (2 pre-existing skips are live-API tests).parsedBody,_get/_post/_patch/_put/_delete, or any public method signature.Adopter impact
60sis generous. I checked the sibling integrations (hubspot, clockwork, attio, quo) and none have documented long-running calls beyond OAuth refresh, which completes well under 60s.clockwork-recruiting--friggapp can delete its adapter-levelwithTimeoutworkaround — the core fix supersedes it.🤖 Generated with Claude Code
📦 Published PR as canary version:
2.0.0--canary.579.2d1eba8.0✨ Test out this PR locally via:
npm install @friggframework/core@2.0.0--canary.579.2d1eba8.0 npm install @friggframework/devtools@2.0.0--canary.579.2d1eba8.0 npm install @friggframework/eslint-config@2.0.0--canary.579.2d1eba8.0 npm install @friggframework/prettier-config@2.0.0--canary.579.2d1eba8.0 npm install @friggframework/schemas@2.0.0--canary.579.2d1eba8.0 npm install @friggframework/serverless-plugin@2.0.0--canary.579.2d1eba8.0 npm install @friggframework/test@2.0.0--canary.579.2d1eba8.0 npm install @friggframework/ui@2.0.0--canary.579.2d1eba8.0 # or yarn add @friggframework/core@2.0.0--canary.579.2d1eba8.0 yarn add @friggframework/devtools@2.0.0--canary.579.2d1eba8.0 yarn add @friggframework/eslint-config@2.0.0--canary.579.2d1eba8.0 yarn add @friggframework/prettier-config@2.0.0--canary.579.2d1eba8.0 yarn add @friggframework/schemas@2.0.0--canary.579.2d1eba8.0 yarn add @friggframework/serverless-plugin@2.0.0--canary.579.2d1eba8.0 yarn add @friggframework/test@2.0.0--canary.579.2d1eba8.0 yarn add @friggframework/ui@2.0.0--canary.579.2d1eba8.0