Handle BrokenProcessPool with isolated single-process fallback#2404
Handle BrokenProcessPool with isolated single-process fallback#2404vreutskyy wants to merge 6 commits intonewton-physics:mainfrom
Conversation
…physics#1847) The parallel test runner's ProcessPoolExecutor reuses worker processes indefinitely. On Windows (spawn context), each worker accumulates host memory from Warp kernel compilation, OpenUSD data structures, and test allocations that persist across test suites. When aggregate memory exceeds available RAM, Windows terminates a worker process, which surfaces as a BrokenProcessPool exception. Set max_tasks_per_child=1 (available in Python 3.11+) so each worker is recycled after completing one test suite. Warp kernels are cached on disk, so re-initialization after a worker restart loads from cache rather than recompiling. The legacy multiprocessing.Pool fallback already supports this via --disable-process-pooling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughReplace the single bulk Changes
Sequence Diagram(s)sequenceDiagram
rect rgba(220,240,255,0.5)
participant Runner
participant ParallelExecutor
participant Worker
participant FallbackExecutor
end
Runner->>ParallelExecutor: submit all suites (map)
ParallelExecutor->>Worker: run suite tasks in parallel
alt all results within _SUITE_TIMEOUT
Worker-->>ParallelExecutor: suite results
ParallelExecutor-->>Runner: yield results
else Timeout / BrokenProcessPool / Exception
ParallelExecutor-->>Runner: raise TimeoutError/BrokenProcessPool/Exception
Runner->>Runner: check CI env (CI/GITHUB_ACTIONS/GITLAB_CI)
alt CI present
Runner-->>Runner: exit process (abort)
else CI absent
Runner->>FallbackExecutor: for each remaining suite submit single-worker executor
FallbackExecutor->>Worker: run suite isolated (max_workers=1)
alt result within _SUITE_TIMEOUT
Worker-->>FallbackExecutor: suite result
FallbackExecutor-->>Runner: yield result
else timeout/crash
FallbackExecutor-->>Runner: create_crash_result(suite)
end
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@newton/tests/thirdparty/unittest_parallel.py`:
- Around line 243-245: The ProcessPoolExecutor branch unconditionally sets
worker recycling on Python 3.11+ causing inconsistent behavior with the
multiprocessing.Pool branch; update the logic around executor_kwargs and
ProcessPoolExecutor so it respects the same CLI flag by checking
args.disable_process_pooling before adding max_tasks_per_child (i.e., only set
executor_kwargs["max_tasks_per_child"]=1 when sys.version_info >= (3,11) AND
args.disable_process_pooling is false), ensuring ProcessPoolExecutor,
executor_kwargs and args.disable_process_pooling are used consistently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: c08308b9-0d81-4ac2-ac05-5be91563883c
📒 Files selected for processing (1)
newton/tests/thirdparty/unittest_parallel.py
| if sys.version_info >= (3, 11): | ||
| executor_kwargs["max_tasks_per_child"] = 1 | ||
| with concurrent.futures.ProcessPoolExecutor(**executor_kwargs) as executor: |
There was a problem hiding this comment.
--disable-process-pooling no longer controls the default executor path.
On Python 3.11+, this now recycles ProcessPoolExecutor workers unconditionally, while the multiprocessing.Pool branch still keys recycling off args.disable_process_pooling. That makes backend behavior inconsistent and removes the existing opt-in/opt-out control from the default path. If the new default is intentional, please update the CLI contract/help/docs accordingly; otherwise, thread args.disable_process_pooling through here as well.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@newton/tests/thirdparty/unittest_parallel.py` around lines 243 - 245, The
ProcessPoolExecutor branch unconditionally sets worker recycling on Python 3.11+
causing inconsistent behavior with the multiprocessing.Pool branch; update the
logic around executor_kwargs and ProcessPoolExecutor so it respects the same CLI
flag by checking args.disable_process_pooling before adding max_tasks_per_child
(i.e., only set executor_kwargs["max_tasks_per_child"]=1 when sys.version_info
>= (3,11) AND args.disable_process_pooling is false), ensuring
ProcessPoolExecutor, executor_kwargs and args.disable_process_pooling are used
consistently.
There was a problem hiding this comment.
Intentional — the two paths serve different roles:
- The
ProcessPoolExecutorpath is the default that all users hit. This is where Test error on Horde Win11 L40: concurrent.futures.process.BrokenProcessPool #1847 keeps recurring, so it needs to be safe out of the box. Making recycling unconditional here is the fix. - The
multiprocessing.Poolpath is a fallback behind--disable-concurrent-futures, used for debugging.--disable-process-poolingwas added as an opt-in for that fallback path specifically.
Gating the fix behind --disable-process-pooling would keep recycling opt-in on the default path, which is exactly the status quo that keeps crashing. Nobody passes that flag today, and requiring it would not fix the issue.
The performance cost of always recycling is negligible — Warp kernels are cached on disk, so worker re-initialization loads from cache rather than recompiling.
There was a problem hiding this comment.
@vreutskyy, thanks for the clear explanation — that makes the design contract explicit and the reasoning is sound. The two flags operate on completely separate code paths with different roles, so there's no inconsistency. My concern is withdrawn.
✏️ Learnings added
Learnt from: vreutskyy
Repo: newton-physics/newton PR: 2404
File: newton/tests/thirdparty/unittest_parallel.py:243-245
Timestamp: 2026-04-10T08:53:15.343Z
Learning: In newton-physics/newton unittest_parallel.py, `max_tasks_per_child=1` on the `ProcessPoolExecutor` path is intentionally unconditional (guarded only by Python >= 3.11). The `ProcessPoolExecutor` branch is the default path that all users hit and must recycle workers out of the box to prevent BrokenProcessPool (issue `#1847`). The `--disable-process-pooling` flag is scoped exclusively to the `multiprocessing.Pool` fallback path (behind `--disable-concurrent-futures`), which is a debug-only path. Do not flag the absence of `--disable-process-pooling` gating on the `ProcessPoolExecutor` path as inconsistency.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: jumyungc
Repo: newton-physics/newton PR: 1333
File: newton/_src/sim/builder.py:0-0
Timestamp: 2026-01-13T03:11:40.556Z
Learning: Ensure all VBD damping semantics follow Rayleigh-style damping and are unitless. This convention, demonstrated in SolverVBD, should be applied consistently across all VBD-related constraints in the codebase. In newton/_src/sim/builder.py, validate that any damping terms use the same normalization and are treated as dimensionless; add comments and unit tests to enforce consistency across VBD components.
Learnt from: adenzler-nvidia
Repo: newton-physics/newton PR: 1645
File: newton/tests/test_mujoco_solver.py:0-0
Timestamp: 2026-02-16T15:42:56.636Z
Learning: In tests for the Newton solver, prefer importing and using SolverMuJoCo._mujoco_warp instead of importing mujoco_warp directly. This avoids hard dependencies on the external MJWarp package and enables skipping/xfailing tests gracefully when MJWarp isn't installed. Update test imports to import SolverMuJoCo and reference SolverMuJoCo._mujoco_warp, and guard optional behavior with pytest.importorskip or try/except ImportError as appropriate.
Learnt from: adenzler-nvidia
Repo: newton-physics/newton PR: 1773
File: newton/tests/test_mujoco_solver.py:7550-7556
Timestamp: 2026-02-26T12:51:12.942Z
Learning: Do not add defensive try/except guards around SolverMuJoCo construction in tests that rely on mujoco_warp features (e.g., set_length_range, set_const_0). The repo pins mujoco-warp >=3.5.0.1 (PR `#1773`), so tests may assume these APIs exist. If a test would fail due to a missing API, prefer failing fast or asserting the API/version instead of swallowing the error.
Learnt from: jvonmuralt
Repo: newton-physics/newton PR: 1925
File: newton/examples/contacts/example_brick_stacking.py:385-385
Timestamp: 2026-03-05T12:15:57.546Z
Learning: In Warp kernel code within the newton-physics/newton repository, composite type slicing (e.g., wp.vec4[:3], wp.mat33[:-1, 1:]) is supported since v1.9.0, and array-level negative indexing/slicing inside kernels was extended in v1.10.0. Expressions like ee_rot_target[tid][:3] (where ee_rot_target is wp.array(dtype=wp.vec4)) are valid and should not be flagged as unsupported slice syntax. Apply this understanding broadly to Python files in Warp contexts during code reviews.
Learnt from: eric-heiden
Repo: newton-physics/newton PR: 1988
File: newton/tests/thirdparty/unittest_parallel.py:245-261
Timestamp: 2026-03-07T20:07:41.440Z
Learning: In newton-physics/newton, this learning explains that the menagerie pre-download test list covers only non-skipped menagerie tests and test_robot_composer. Do not mark it incomplete based solely on raw robot_folder assignments. Be aware that in test_menagerie_mujoco.py placeholder MJCF classes set skip_reason and TestMenagerieBase.setUpClass() may skip before download_menagerie_asset() runs, and in test_menagerie_usd_mujoco.py placeholder USD stubs skip before super().setUpClass() when usd_asset_folder is unset, so downloads aren’t triggered. When reviewing this area, validate expectations against the actual skip timing and ensure that absence of a download early in test setup is intentional and not a sign of missing test coverage.
Learnt from: preist-nvidia
Repo: newton-physics/newton PR: 1968
File: newton/_src/solvers/xpbd/solver_xpbd.py:249-250
Timestamp: 2026-03-09T14:12:43.599Z
Learning: In the repository newton-physics/newton, PRs labeled '1.0-release' are considered part of the pre-stable API cleanup and are exempt from the deprecation-first policy and the [Unreleased] CHANGELOG entry requirement. Breaking changes included in the 1.0 release do not require a prior deprecation window. This applies to Python files across the codebase.
Learnt from: vreutskyy
Repo: newton-physics/newton PR: 2074
File: newton/tests/test_import_usd.py:0-0
Timestamp: 2026-03-12T22:28:25.778Z
Learning: In test_import_usd.py, if testing USD import cycles, ensure that the TestResolveUsdFromUrl.test_cyclic_references validates that each URL (scene.usd, a.usd, b.usd) is fetched exactly once to guard against root re-download regressions. Use mocks/spies on the fetch mechanism and assert call counts per URL, and verify no extra fetches occur during the cycle. This pattern should apply to other tests that exercise URL-based resource graphs and potential re-downloads.
Learnt from: shi-eric
Repo: newton-physics/newton PR: 2242
File: newton/_src/utils/import_usd.py:184-184
Timestamp: 2026-03-27T16:48:54.491Z
Learning: Because `parse_usd` in `newton/_src/utils/import_usd.py` is a private helper and is not re-exported as part of the public API (the public entry point is `ModelBuilder.add_usd`), avoid Sphinx cross-references (`:func:`) to `parse_usd` in docstrings. If you must mention it, use inline code literals (e.g., ``parse_usd()``) rather than `:func:` references, since there is no valid public Sphinx target for this private function.
Learnt from: eric-heiden
Repo: newton-physics/newton PR: 2261
File: newton/_src/sim/builder.py:0-0
Timestamp: 2026-04-01T14:40:44.664Z
Learning: When implementing a deprecation shim in the Python codebase, avoid using a sentinel/default strategy that can’t reliably distinguish “omitted” from “explicit default” if the only downside is a single corner-case ambiguity. Prefer keeping clean public signatures (no sentinel) unless doing so would violate a documented guarantee. If you keep the clean signature, ensure the documentation/release notes (docs/ and/or CHANGELOG) explicitly forbid mixing the old and new names/parameters in the same call to prevent precedence ambiguity (e.g., which name wins).
There was a problem hiding this comment.
I have two follow up questions:
- Can you quantify perf impact on CICD UT runs of this change?
- Why is the change not limited to Windows?
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
@Kenny-Vilella can you please have a look as well? We are delaying the release to Monday to resolve this. |
|
FYI, this file is from warp. I think that warp has improved this file compare to the Newton version: We should ask someone from the warp team that is familiar with this file to advise. |
…n-physics#1847) Port crash-resilience improvements from Warp's unittest_parallel.py. Instead of aborting the entire test run when a worker process dies unexpectedly, catch BrokenProcessPool (and TimeoutError) and retry all test suites in isolated single-process mode — one fresh ProcessPoolExecutor(max_workers=1) per suite. This replaces the earlier max_tasks_per_child approach: since the root cause of the worker crashes is not reliably reproducible, surviving the crash gracefully is more robust than trying to prevent it. The fallback has zero performance impact on the happy path. Key changes: - Catch BrokenProcessPool, TimeoutError, and generic exceptions around the parallel executor.map() call - On failure in CI (GITHUB_ACTIONS, GITLAB_CI, CI), exit immediately to respect job timeout constraints - On failure outside CI, retry each suite in an isolated single-worker pool with per-suite error reporting via create_crash_result() - Add _get_suite_name() and create_crash_result() helpers - Increase suite timeout from 2400s to 3600s (aligned with Warp) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@newton/tests/thirdparty/unittest_parallel.py`:
- Around line 241-265: The warning message incorrectly implies a per-suite
blocker even though executor.map(timeout=_SUITE_TIMEOUT) enforces a wall-clock
limit on the whole parallel phase; update the code to either implement a true
per-suite timeout (submit each suite via executor.submit and enforce per-future
timeouts with as_completed/future.result(timeout=_SUITE_TIMEOUT)) or, more
simply, change the warning text printed in the except TimeoutError block (the
print that references pending_index and suite_name) to remove the misleading
“different suite may be the actual blocker” wording and state that the parallel
phase hit the global wall-clock timeout; refer to executor.map, _SUITE_TIMEOUT,
ParallelTestManager.run_tests, pending_index and suite_name to locate the
relevant code to change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: b073c4b9-64ee-4afd-a482-0ba843ee6584
📒 Files selected for processing (1)
newton/tests/thirdparty/unittest_parallel.py
| try: | ||
| with concurrent.futures.ProcessPoolExecutor( | ||
| max_workers=process_count, | ||
| mp_context=multiprocessing.get_context(method="spawn"), | ||
| initializer=initialize_test_process, | ||
| initargs=(manager.Lock(), shared_index, args, temp_dir), | ||
| ) as executor: | ||
| test_manager = ParallelTestManager(manager, args, temp_dir) | ||
| for result in executor.map(test_manager.run_tests, test_suites, timeout=_SUITE_TIMEOUT): | ||
| results.append(result) | ||
|
|
||
| except TimeoutError: | ||
| pending_index = len(results) | ||
| total = len(test_suites) | ||
| suite_name = _get_suite_name(test_suites[pending_index]) if pending_index < total else "unknown" | ||
| print( | ||
| f"Warning: Parallel execution timed out (total timeout={_SUITE_TIMEOUT}s). " | ||
| f"Next pending result was suite " | ||
| f"{pending_index + 1}/{total} ({suite_name}), " | ||
| f"but a different suite may be the actual blocker. " | ||
| f"Switching to isolated single-process fallback.", | ||
| file=sys.stderr, | ||
| ) | ||
| parallel_failed = True | ||
| parallel_fail_reason = "timed out" |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In Python's concurrent.futuresdocumentation, doesExecutor.map(timeout=...)apply the timeout per item, or from the originalExecutor.map() call?
💡 Result:
Web search failed: Server error: no LLM provider could handle the message
🌐 Web query:
Python concurrent.futures Executor.map timeout semantics per item vs total duration
💡 Result:
Executor.map(..., timeout=T) is a single overall deadline, not “T seconds per item”.
- Python sets a deadline at the moment you call
Executor.map(). The iterator it returns will raiseTimeoutErrorwhen you call__next__()and the next result isn’t ready byTseconds after that originalmap()call. So eachnext()effectively only gets the remaining time until the same global deadline. [1] - If you never iterate the returned iterator (e.g., you don’t loop it,
list()it, etc.), then__next__()is never called and no timeout is triggered. [2] - This timeout does not cancel/stop running work; it only stops waiting for (ordered) results. Also, because
map()yields results in input order, one slow early item can block yielding later items even if they’re already finished. [1]
If you want per-item timeouts, use submit() and then future.result(timeout=...) per future (or manage futures with as_completed()), rather than relying on Executor.map(timeout=...). [1]
Sources: Python docs for Executor.map() semantics [1]; discussion/example showing timeout only occurs during iteration [2].
🏁 Script executed:
# First, let's examine the actual code in the file to understand context
cat -n newton/tests/thirdparty/unittest_parallel.py | sed -n '240,270p'Repository: newton-physics/newton
Length of output: 2127
🏁 Script executed:
# Let's also check if there are any comments in the code about timeout intent
rg -A 5 -B 5 "executor.map.*timeout" newton/tests/thirdparty/unittest_parallel.pyRepository: newton-physics/newton
Length of output: 741
🏁 Script executed:
# Check the definition of _SUITE_TIMEOUT to understand the intent
rg "_SUITE_TIMEOUT" newton/tests/thirdparty/unittest_parallel.pyRepository: newton-physics/newton
Length of output: 684
The timeout comment and warning message are misleading about the timeout scope.
Line 249's executor.map(timeout=_SUITE_TIMEOUT) applies a wall-clock deadline to the entire parallel phase (not per-suite), which matches the _SUITE_TIMEOUT comment "Wall-clock limit for parallel execution". However, the warning message at line 257 states "Next pending result was suite ... but a different suite may be the actual blocker," which is inconsistent with wall-clock timeout semantics—under a global deadline, which suite is slow doesn't matter; the entire parallel phase has exhausted its time budget. If the intent is truly a per-suite wall-clock cap for parallel execution (to match the isolated fallback's per-suite behavior), clarify the code comment and revise the warning message to reflect that. Otherwise, update the warning to remove the misleading reference to specific suites being blockers when the timeout is global.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@newton/tests/thirdparty/unittest_parallel.py` around lines 241 - 265, The
warning message incorrectly implies a per-suite blocker even though
executor.map(timeout=_SUITE_TIMEOUT) enforces a wall-clock limit on the whole
parallel phase; update the code to either implement a true per-suite timeout
(submit each suite via executor.submit and enforce per-future timeouts with
as_completed/future.result(timeout=_SUITE_TIMEOUT)) or, more simply, change the
warning text printed in the except TimeoutError block (the print that references
pending_index and suite_name) to remove the misleading “different suite may be
the actual blocker” wording and state that the parallel phase hit the global
wall-clock timeout; refer to executor.map, _SUITE_TIMEOUT,
ParallelTestManager.run_tests, pending_index and suite_name to locate the
relevant code to change.
I wouldn’t recommend adopting Warp’s approach. The single process fallback was meant for QA use so that they could get some results in light of known failures at the time. I think the right thing to do is hold off on merging anything until we actually characterize what’s going on to cause a BrokenProcessPool, e.g. set PYTHONFAULTHANDLER=1 in the environment. This should give more details about the crash, which is usually a segfault. |
Description
Closes #1847
Ports crash-resilience improvements from Warp's
unittest_parallel.pyto handle the recurringBrokenProcessPoolcrash in the parallel test runner.Problem
The parallel test runner uses
ProcessPoolExecutorto run test suites across multiple workers. Intermittently — on Windows Horde machines (16 GB RAM), CI (64 GB RAM), and potentially other environments — a worker process dies abruptly, causingBrokenProcessPoolwhich aborts the entire test run with no useful diagnostics.The root cause has not been reliably reproduced despite multiple attempts (local 32 GB machines with up to 32 workers, Horde Ubuntu 32 GB with 32 workers pushing swap to 0 bytes free). Previous fixes addressed specific crash vectors (#1988 OpenUSD threading, #2262 Open3D Poisson threading) but the issue keeps recurring.
Approach
Instead of trying to prevent the crash (which we can't reliably reproduce), survive it gracefully — ported from Warp's battle-tested implementation:
BrokenProcessPool(andTimeoutError) around theexecutor.map()call instead of letting it abort the runGITHUB_ACTIONS,GITLAB_CI,CI): exit immediately to respect job timeout constraintsProcessPoolExecutor(max_workers=1)per suite, with per-suite error reporting viacreate_crash_result()Additional changes
_get_suite_name()andcreate_crash_result()helpersChecklist
CHANGELOG.mdhas been updated (if user-facing change)Test plan
The fallback path activates only when
BrokenProcessPool(orTimeoutError) occurs during parallel execution. Normal test runs are unaffected — verified locally on Windows 11 (32 GB, 8 and 16 workers) and Horde Ubuntu (32 GB, 32 workers).Bug fix
Steps to reproduce:
The crash is intermittent and environment-dependent. Originally reported on Win11 Horde machines with 16 GB RAM and L40 GPU. Has also appeared on Linux CI with 64 GB RAM. Could not be reproduced locally despite aggressive attempts (32 workers, swap exhaustion).
🤖 Generated with Claude Code
Summary by CodeRabbit