Skip to content

fix(cache): cache repeated evals by repeat index#8480

Merged
mldangelo merged 11 commits intomainfrom
mdangelo/codex/repeat-cache-namespaces-1431
Apr 10, 2026
Merged

fix(cache): cache repeated evals by repeat index#8480
mldangelo merged 11 commits intomainfrom
mdangelo/codex/repeat-cache-namespaces-1431

Conversation

@mldangelo-oai
Copy link
Copy Markdown
Contributor

Summary

  • keep cache enabled when repeat > 1 and isolate cache entries by repeat index
  • namespace direct cache access, fetch cache keys, in-flight fetch dedupe, bulk cache APIs, and namespace-local clear operations
  • add evaluator/cache regressions and document repeat-aware cache behavior

QA

  • npm run tsc
  • npm test (635 files passed, 15298 tests passed, 22 skipped)
  • npx vitest run test/cache.test.ts test/evaluator.test.ts test/commands/eval/evaluateOptions.test.ts
  • real OpenAI CLI smoke with --repeat 2 and --max-concurrency 2: cold run returned distinct uncached repeat rows; warm rerun replayed the same rows with cached=true
  • adversarial scripts verified namespace isolation for mget/mset/mdel/ttl/clear and concurrent repeat execution

Closes #1431

@mldangelo-oai mldangelo-oai requested review from a team, ianw-oai and zcrab-oai as code owners April 4, 2026 08:14
Copilot AI review requested due to automatic review settings April 4, 2026 08:14
Copy link
Copy Markdown
Contributor

@promptfoo-scanner promptfoo-scanner bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 All Clear

I reviewed the PR changes related to cache namespacing and repeat-aware caching in evaluator and cache utilities. The updates introduce AsyncLocalStorage-backed namespaces and adjust when caching is disabled, but do not change prompt construction, tool capabilities, or execution paths. Based on tracing and analysis, no LLM-security vulnerabilities were identified in this PR.

Minimum severity threshold: 🟡 Medium | To re-scan after changes, comment @promptfoo-scanner
Learn more


Was this helpful?  👍 Yes  |  👎 No 

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables caching when evaluateOptions.repeat > 1 by scoping cache access to a repeat-specific namespace, so each repeat index gets its own cached entries while preserving provider cache behavior.

Changes:

  • Stop globally disabling cache when repeat > 1 (CLI + programmatic entrypoints).
  • Introduce cache namespacing (AsyncLocalStorage-backed) and apply it to evaluator execution per repeat index.
  • Add regression tests and documentation for repeat-aware caching behavior.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/cache.ts Adds cache namespace support and scopes fetch-cache keys/in-flight dedupe by namespace.
src/evaluator.ts Runs eval and extension hooks inside a repeat-index cache namespace and removes repeat-based cache busting.
src/index.ts Keeps cache enabled when repeat > 1 unless explicitly disabled.
src/commands/eval.ts Keeps cache enabled when repeat > 1 unless explicitly disabled.
test/cache.test.ts Expands cache mocks and adds tests for namespace isolation (direct/bulk/clear/in-flight).
test/evaluator.test.ts Updates repeat behavior expectation and adds repeat-aware cache isolation regressions.
test/evaluator.integration.transforms.test.ts Updates cache mock to include withCacheNamespace.
test/commands/eval/evaluateOptions.test.ts Updates expectation/comment to reflect repeat no longer disabling cache.
site/docs/guides/evaluate-llm-temperature.md Documents repeat-index-specific caching and how to disable it.
site/docs/configuration/caching.md Documents repeat-index cache namespaces and guidance for --no-cache with --repeat.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}),
reset: vi.fn(),
ttl: vi.fn(),
ttl: vi.fn().mockImplementation((key: string) => Promise.resolve(expiresAt.get(key))),
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache-manager mock’s ttl() implementation returns the absolute expiration timestamp (expiresAt.get(key)), but cache-manager’s ttl() API is intended to return the remaining TTL duration (or undefined/null when not set). This mismatch can make TTL-related tests pass while diverging from real behavior; consider returning Math.max(0, expiresAt.get(key) - Date.now()) (and undefined when absent).

Suggested change
ttl: vi.fn().mockImplementation((key: string) => Promise.resolve(expiresAt.get(key))),
ttl: vi.fn().mockImplementation((key: string) => {
const expiry = expiresAt.get(key);
if (expiry === undefined) {
return Promise.resolve(undefined);
}
return Promise.resolve(Math.max(0, expiry - Date.now()));
}),

Copilot uses AI. Check for mistakes.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 4, 2026

📝 Walkthrough

Walkthrough

The pull request implements cache namespacing via AsyncLocalStorage to enable caching when evaluateOptions.repeat is greater than 1. Previously, caching was globally disabled when repeat count exceeded 1. Now each repeat index receives a distinct cache namespace (e.g., repeat:0, repeat:1), allowing subsequent identical eval runs to reuse cached responses per repeat while keeping outputs separated across iterations. The cache infrastructure now transparently prefixes keys with namespace identifiers, and a new withCacheNamespace() helper manages namespace scope during async execution. Documentation has been updated to clarify this behavior, and cache-disable logic has been adjusted to only trigger when explicitly disabled, not by repeat count.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Details

The changes introduce new async context management patterns (AsyncLocalStorage) for cache isolation and refactor the evaluator's control flow to apply per-repeat namespacing. While the individual logic pieces are reasonably straightforward, the interconnected nature of cache infrastructure, evaluator semantics, and signature modifications across multiple files requires careful reasoning about namespace propagation, cache key generation, and eval flow integration. The comprehensive test additions support validation but do not reduce the core review complexity.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.53% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: enabling cache for repeated evaluations by using repeat-index-based namespacing.
Description check ✅ Passed The description is directly related to the changeset, detailing cache behavior changes, namespace isolation, and testing performed.
Linked Issues check ✅ Passed The PR fully addresses #1431 by implementing per-repeat-index cache namespacing to enable caching when repeat > 1 and isolate cache entries.
Out of Scope Changes check ✅ Passed All changes are directly scoped to enabling caching for repeated evaluations through cache namespacing and related infrastructure updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mdangelo/codex/repeat-cache-namespaces-1431

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
src/cache.ts (1)

391-392: Minor: getCacheInstance() is synchronous but awaited.

getCacheInstance() returns Cache directly (not a Promise), but it's being awaited on Line 392. This works but is unnecessary.

♻️ Suggested fix
   const cacheKey = getScopedCacheKey(`fetch:v2:${url}:${JSON.stringify(copy)}`);
-  const cache = await getCacheInstance();
+  const cache = getCacheInstance();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cache.ts` around lines 391 - 392, The code awaits getCacheInstance() even
though it returns a Cache synchronously; remove the unnecessary await so that
cache is assigned directly from getCacheInstance() (update the line referencing
getCacheInstance and any nearby usage of cacheKey/getScopedCacheKey if needed).
Ensure no other call sites rely on it being a Promise and run tests to confirm
behavior.
src/evaluator.ts (2)

1649-1657: Keep the repeat cache boundary in one place.

processEvalStep() already enters the per-repeat namespace, so calling runEval() here re-enters the same scope. Calling runEvalInternal(evalStep) from this path would make the active namespace easier to reason about and avoid future drift if the namespace helper changes.

♻️ Possible simplification
-          const rows = await runEval(evalStep);
+          const rows = await runEvalInternal(evalStep);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/evaluator.ts` around lines 1649 - 1657, processEvalStep currently already
enters the per-repeat cache namespace via
withCacheNamespace(getRepeatCacheNamespace(...)), so inside that block you
should call runEvalInternal(evalStep) instead of runEval(evalStep) to avoid
re-entering the same namespace; update the call in the withCacheNamespace
callback (replace runEval with runEvalInternal) and ensure runEvalInternal is
visible here (import/adjust export if necessary) so the active namespace remains
consistent for evalStep.repeatIndex/evaluateOptions.

1715-1734: Use structured logger context for these new error paths.

Both error logs serialize details into the message string, which skips the logger's structured, auto-sanitized context path. Keep the message static and attach the dynamic fields as context instead.

🪵 Suggested logging shape
-              logger.error(`Error saving result: ${error} ${safeJsonStringify(resultSummary)}`);
+              logger.error('[Evaluator] Error saving result', {
+                error,
+                resultSummary,
+              });

@@
-              logger.error(
-                `Target returned HTTP ${httpStatus}. Aborting scan - this error will not resolve on retry.`,
-              );
+              logger.error('[Evaluator] Target unavailable, aborting scan', {
+                httpStatus,
+              });
As per coding guidelines, "Use the logger with object context (auto-sanitized) for logging: `logger.debug('[Component] Message', { headers, body, config })`".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/evaluator.ts` around lines 1715 - 1734, The error logs are embedding
dynamic details into the message string instead of using structured logger
context; update the two logger.error calls so messages stay static and pass
dynamic fields in the context object: for the catch around
this.evalRecord.addResult(), keep a static message like "Error saving result"
and attach { error, resultSummary: summarizeEvaluateResultForLogging(row) } (or
the precomputed resultSummary) as the second argument to logger.error; for the
HTTP-status branch, keep the static message "Target returned non-transient HTTP
status. Aborting scan." and pass { httpStatus, rowMetadata:
row.response?.metadata } (or similar relevant fields) as structured context;
ensure references to this.evalRecord.addResult,
summarizeEvaluateResultForLogging, logger.error, httpStatus, and
isNonTransientHttpStatus are used to locate where to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/evaluator.test.ts`:
- Around line 4000-4141: The tests "isolates manual provider cache entries by
repeat index" and "isolates beforeEach extension cache entries by repeat index"
mutate shared cache and only call clearCache() at start; add teardown cleanup by
calling clearCache() after each test (e.g., in a finally block at the end of
each it() or by adding an afterEach(async () => await clearCache()) at the top
of the spec) so that the cache is cleared regardless of test outcome; update the
tests referencing clearCache(), evaluate(), Eval.create(), and the mocked
runExtensionHook to ensure clearCache() runs after each test to prevent
cross-test leakage.

---

Nitpick comments:
In `@src/cache.ts`:
- Around line 391-392: The code awaits getCacheInstance() even though it returns
a Cache synchronously; remove the unnecessary await so that cache is assigned
directly from getCacheInstance() (update the line referencing getCacheInstance
and any nearby usage of cacheKey/getScopedCacheKey if needed). Ensure no other
call sites rely on it being a Promise and run tests to confirm behavior.

In `@src/evaluator.ts`:
- Around line 1649-1657: processEvalStep currently already enters the per-repeat
cache namespace via withCacheNamespace(getRepeatCacheNamespace(...)), so inside
that block you should call runEvalInternal(evalStep) instead of
runEval(evalStep) to avoid re-entering the same namespace; update the call in
the withCacheNamespace callback (replace runEval with runEvalInternal) and
ensure runEvalInternal is visible here (import/adjust export if necessary) so
the active namespace remains consistent for
evalStep.repeatIndex/evaluateOptions.
- Around line 1715-1734: The error logs are embedding dynamic details into the
message string instead of using structured logger context; update the two
logger.error calls so messages stay static and pass dynamic fields in the
context object: for the catch around this.evalRecord.addResult(), keep a static
message like "Error saving result" and attach { error, resultSummary:
summarizeEvaluateResultForLogging(row) } (or the precomputed resultSummary) as
the second argument to logger.error; for the HTTP-status branch, keep the static
message "Target returned non-transient HTTP status. Aborting scan." and pass {
httpStatus, rowMetadata: row.response?.metadata } (or similar relevant fields)
as structured context; ensure references to this.evalRecord.addResult,
summarizeEvaluateResultForLogging, logger.error, httpStatus, and
isNonTransientHttpStatus are used to locate where to change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e9430241-632d-4748-bba6-84198d593640

📥 Commits

Reviewing files that changed from the base of the PR and between fe6d376 and e47e27b.

📒 Files selected for processing (10)
  • site/docs/configuration/caching.md
  • site/docs/guides/evaluate-llm-temperature.md
  • src/cache.ts
  • src/commands/eval.ts
  • src/evaluator.ts
  • src/index.ts
  • test/cache.test.ts
  • test/commands/eval/evaluateOptions.test.ts
  • test/evaluator.integration.transforms.test.ts
  • test/evaluator.test.ts

Comment on lines +4000 to 4141
it('isolates manual provider cache entries by repeat index', async () => {
await clearCache();

let cacheMissCount = 0;
const provider: ApiProvider = {
id: () => 'mock-provider',
callApi: vi
.fn()
.mockImplementation(async (_prompt: string, context?: Record<string, any>) => {
const cache = await context?.getCache();
const cachedResponse = await cache?.get('manual-provider-key');
if (cachedResponse) {
return {
...(cachedResponse as ProviderResponse),
cached: true,
};
}

cacheMissCount += 1;
const response = {
cached: false,
output: `result-repeat-${context?.repeatIndex}-miss-${cacheMissCount}`,
tokenUsage: createEmptyTokenUsage(),
};
await cache?.set('manual-provider-key', response);
return response;
}),
};

const testSuite: TestSuite = {
providers: [provider],
prompts: [toPrompt('Test prompt')],
tests: [{}],
};

const firstEval = await Eval.create({}, testSuite.prompts, { id: randomUUID() });
await evaluate(testSuite, firstEval, { maxConcurrency: 1, repeat: 2 });
const firstSummary = await firstEval.toEvaluateSummary();

expect(cacheMissCount).toBe(2);
expect(firstSummary.results.map((result) => result.response?.output)).toEqual([
'result-repeat-0-miss-1',
'result-repeat-1-miss-2',
]);
expect(firstSummary.results.map((result) => result.response?.cached)).toEqual([false, false]);

const secondEval = await Eval.create({}, testSuite.prompts, { id: randomUUID() });
await evaluate(testSuite, secondEval, { maxConcurrency: 1, repeat: 2 });
const secondSummary = await secondEval.toEvaluateSummary();

expect(cacheMissCount).toBe(2);
expect(secondSummary.results.map((result) => result.response?.output)).toEqual([
'result-repeat-0-miss-1',
'result-repeat-1-miss-2',
]);
expect(secondSummary.results.map((result) => result.response?.cached)).toEqual([true, true]);
});

it('isolates beforeEach extension cache entries by repeat index', async () => {
await clearCache();

let extensionCacheMissCount = 0;
vi.mocked(runExtensionHook).mockImplementation(async (_extensions, hookName, context) => {
if (hookName !== 'beforeEach' || !('test' in context)) {
return context;
}

const hookContext = context as typeof context & {
test: {
vars?: Record<string, unknown>;
};
};

const cache = getCache();
const cachedHookValue = await cache.get<string>('extension-hook-key');
if (cachedHookValue) {
return {
...hookContext,
test: {
...hookContext.test,
vars: {
...hookContext.test.vars,
hookValue: cachedHookValue,
},
},
};
}

extensionCacheMissCount += 1;
const hookValue = `hook-repeat-${extensionCacheMissCount}`;
await cache.set('extension-hook-key', hookValue);

return {
...hookContext,
test: {
...hookContext.test,
vars: {
...hookContext.test.vars,
hookValue,
},
},
};
});

const provider: ApiProvider = {
id: () => 'mock-provider',
callApi: vi
.fn()
.mockImplementation(async (_prompt: string, context?: Record<string, any>) => ({
cached: false,
output: context?.vars?.hookValue,
tokenUsage: createEmptyTokenUsage(),
})),
};

const testSuite: TestSuite = {
providers: [provider],
prompts: [toPrompt('Test prompt')],
tests: [{ vars: {} }],
extensions: ['file://hook.js'],
};

const firstEval = await Eval.create({}, testSuite.prompts, { id: randomUUID() });
await evaluate(testSuite, firstEval, { maxConcurrency: 1, repeat: 2 });
const firstSummary = await firstEval.toEvaluateSummary();

expect(extensionCacheMissCount).toBe(2);
expect(firstSummary.results.map((result) => result.response?.output)).toEqual([
'hook-repeat-1',
'hook-repeat-2',
]);

const secondEval = await Eval.create({}, testSuite.prompts, { id: randomUUID() });
await evaluate(testSuite, secondEval, { maxConcurrency: 1, repeat: 2 });
const secondSummary = await secondEval.toEvaluateSummary();

expect(extensionCacheMissCount).toBe(2);
expect(secondSummary.results.map((result) => result.response?.output)).toEqual([
'hook-repeat-1',
'hook-repeat-2',
]);
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clear cache after these tests to prevent cross-test leakage.

Both tests mutate shared cache state and only call clearCache() at setup time. Add teardown cleanup so later tests are not affected when execution order is randomized.

🧹 Suggested hardening
   it('isolates manual provider cache entries by repeat index', async () => {
     await clearCache();
+    try {
       // ... existing test body ...
-  });
+    } finally {
+      await clearCache();
+    }
+  });

   it('isolates beforeEach extension cache entries by repeat index', async () => {
     await clearCache();
+    try {
       // ... existing test body ...
-  });
+    } finally {
+      await clearCache();
+    }
+  });

Based on learnings: Tests must be independent and can run in any order (configured to run in random order by default in vitest.config.ts).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/evaluator.test.ts` around lines 4000 - 4141, The tests "isolates manual
provider cache entries by repeat index" and "isolates beforeEach extension cache
entries by repeat index" mutate shared cache and only call clearCache() at
start; add teardown cleanup by calling clearCache() after each test (e.g., in a
finally block at the end of each it() or by adding an afterEach(async () =>
await clearCache()) at the top of the spec) so that the cache is cleared
regardless of test outcome; update the tests referencing clearCache(),
evaluate(), Eval.create(), and the mocked runExtensionHook to ensure
clearCache() runs after each test to prevent cross-test leakage.

…ntext

- Clear `namespacedCacheInstances` map in `clearCache()` to prevent
  unbounded memory growth in long-running processes
- Wrap `clearNamespacedCache` delete operations with try-catch that
  surfaces the namespace and key count in the error message

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@use-tusk
Copy link
Copy Markdown
Contributor

use-tusk bot commented Apr 4, 2026

No test execution environment matched

View output

Tip

New to Tusk Unit Tests? Learn more here.

View check history

Commit Status Output Created (UTC)
2284b69 No test execution environment matched Output Apr 4, 2026 8:36AM

Copy link
Copy Markdown
Member

@mldangelo mldangelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elegant fix replacing the blunt 'disable cache for repeat > 1' with per-repeat-index cache namespaces via AsyncLocalStorage. The withCacheNamespace pattern is clean and composable, supporting nested scoping. Backwards compatible (repeat index 0 uses no namespace). All 184 tests pass (29 cache + 155 evaluator). Good docs additions explaining the behavior.

…-cache-namespaces-1431

# Conflicts:
#	src/evaluator.ts
#	test/evaluator.test.ts
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

promptfoo/src/evaluator.ts

Lines 3012 to 3015 in 928e748

await runExtensionHook(context.testSuite.extensions, 'afterEach', {
test: evalStep.test,
result: row,
});

P1 Badge Scope afterEach hooks to repeat cache namespace

Wrap afterEach extension execution in the same repeat namespace used for beforeEach/runEvalInternal. processEvalStep uses withCacheNamespace(...), but processEvalRows calls runExtensionHook(..., 'afterEach', ...) outside that scope. Any getCache/fetchWithCache in an afterEach hook will share global keys across repeats, causing cross-repeat cache contamination.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Member

@mldangelo mldangelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoroughly reviewed: namespace-based cache isolation for repeat indices is correct and well-tested. 226 tests pass, TSC clean, real evals verified with --repeat 2 (cold and warm runs). AsyncResource.bind correctly preserves ALS context through deferred grading queue. No bugs found.

@mldangelo mldangelo merged commit fbd59a6 into main Apr 10, 2026
40 checks passed
@mldangelo mldangelo deleted the mdangelo/codex/repeat-cache-namespaces-1431 branch April 10, 2026 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow caching when evaluateOptions.repeat is greater than 1

3 participants