[Stack 12/27] Speed up regression tests#2436
Closed
jucor wants to merge 2 commits intojc/clj-parity-d4-fixfrom
Closed
[Stack 12/27] Speed up regression tests#2436jucor wants to merge 2 commits intojc/clj-parity-d4-fixfrom
jucor wants to merge 2 commits intojc/clj-parity-d4-fixfrom
Conversation
b6518b8 to
241ab18
Compare
5 tasks
Contributor
There was a problem hiding this comment.
Pull request overview
This PR speeds up Delphi’s Python regression tests by avoiding unnecessary benchmark reruns and allowing the regression pipeline to skip redundant intermediate computation stages when only an overall match is needed.
Changes:
- Set
ConversationComparer.compare_with_golden()to defaultbenchmark=False(benchmarking remains opt-in). - Add
skip_intermediate_stagesplumbing throughcompare_with_golden()→compute_all_stages()/compute_all_stages_with_benchmark(). - Update the main regression pytest to skip stages 1–4 while keeping the per-stage test unchanged.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
delphi/tests/test_regression.py |
Uses skip_intermediate_stages=True for the overall regression test to reduce runtime. |
delphi/polismath/regression/utils.py |
Adds skip_intermediate_stages to compute_all_stages() and threads it through benchmark runs. |
delphi/polismath/regression/comparer.py |
Changes default benchmark behavior and forwards skip_intermediate_stages into stage computation. |
delphi/docs/PLAN_DISCREPANCY_FIXES.md |
Updates documentation to reflect the current stack/PR mapping and naming. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
cb557b3 to
f0516e8
Compare
013624a to
19d4a5b
Compare
5 tasks
f0516e8 to
ebd71ca
Compare
19d4a5b to
4a57515
Compare
ebd71ca to
758355c
Compare
4a57515 to
967ffec
Compare
758355c to
35d24b1
Compare
967ffec to
4b803ca
Compare
35d24b1 to
d295389
Compare
4b803ca to
f4252a0
Compare
d295389 to
7e6ccc1
Compare
f4252a0 to
3f09ab0
Compare
303fd4e to
081bdb0
Compare
d519507 to
bd78b5e
Compare
081bdb0 to
6093159
Compare
bd78b5e to
478fa73
Compare
6093159 to
cca501b
Compare
ce0c874 to
da4b4de
Compare
49be8f6 to
c620c1e
Compare
da4b4de to
a9b000e
Compare
c620c1e to
707c63d
Compare
a9b000e to
41dec2d
Compare
707c63d to
ddb4e01
Compare
41dec2d to
b3d7506
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ddb4e01 to
8fa1cf3
Compare
b3d7506 to
5611792
Compare
…nd synthetic test tid range - Fix error message "No full blob" → "No incremental blob" in datasets.py - Remove hardcoded blob_type='incremental' in test helpers (use default) - Update docstrings from '*-full' to '*-incremental'/'*-cold_start' - Fix synthetic D2c test: range(5,11) → range(4,10) for valid tids - Add comment explaining xdist_group strategy for blob variants Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two changes that reduce regression test wall time significantly: 1. Default benchmark=False in compare_with_golden(). Benchmark mode runs the full pipeline 3x for timing statistics — useful for perf analysis but unnecessary for correctness checks. The regression_comparer script already had benchmark as opt-in (--benchmark flag), so this aligns the default. Callers that need benchmarking can still pass benchmark=True explicitly. 2. Add skip_intermediate_stages parameter to compute_all_stages(). test_conversation_regression now skips stages 1-4 (empty, load-only, PCA-only, PCA+clustering) since it only checks overall_match. test_conversation_stages_individually still runs all stages. Skipped stages are silently ignored in the comparison (existing behavior for missing stages). For large private datasets, this cuts test_conversation_regression time from ~5x to ~1x of a single pipeline run (e.g. one large conversation went from ~317s to ~60s). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5611792 to
6a96d48
Compare
8fa1cf3 to
5893382
Compare
Delphi Coverage Report
|
This was referenced Mar 30, 2026
Collaborator
Author
|
Superseded by spr-managed PR stack. See the new stack starting at #2508. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
benchmark=Falseincompare_with_golden()— benchmark mode ran the pipeline 3x for timing statistics, unnecessary for correctness checks. Theregression_comparer.pyscript already had--benchmarkas opt-in, so this aligns the default.skip_intermediate_stagesparameter tocompute_all_stages()—test_conversation_regressionnow skips stages 1-4 (empty, load-only, PCA-only, PCA+clustering) since it only checksoverall_match.test_conversation_stages_individuallystill runs all stages for granular failure detection.Measured speedup on one of the large private test conversations
test_conversation_regressiontest_conversation_stages_individuallyThe regression test's ~14x speedup comes from two combined effects: no longer running the pipeline 3x (benchmark), and skipping 4 redundant intermediate stages.
Test plan
--include-local)🤖 Generated with Claude Code