Handle BrokenProcessPool with isolated single-process fallback #2404

coderabbitai · 2026-04-10T08:36:02Z

⚠️ Potential issue | 🟠 Major

--disable-process-pooling no longer controls the default executor path.

On Python 3.11+, this now recycles ProcessPoolExecutor workers unconditionally, while the multiprocessing.Pool branch still keys recycling off args.disable_process_pooling. That makes backend behavior inconsistent and removes the existing opt-in/opt-out control from the default path. If the new default is intentional, please update the CLI contract/help/docs accordingly; otherwise, thread args.disable_process_pooling through here as well.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@newton/tests/thirdparty/unittest_parallel.py` around lines 243 - 245, The ProcessPoolExecutor branch unconditionally sets worker recycling on Python 3.11+ causing inconsistent behavior with the multiprocessing.Pool branch; update the logic around executor_kwargs and ProcessPoolExecutor so it respects the same CLI flag by checking args.disable_process_pooling before adding max_tasks_per_child (i.e., only set executor_kwargs["max_tasks_per_child"]=1 when sys.version_info >= (3,11) AND args.disable_process_pooling is false), ensuring ProcessPoolExecutor, executor_kwargs and args.disable_process_pooling are used consistently.

Intentional — the two paths serve different roles:

The ProcessPoolExecutor path is the default that all users hit. This is where Test error on Horde Win11 L40: concurrent.futures.process.BrokenProcessPool #1847 keeps recurring, so it needs to be safe out of the box. Making recycling unconditional here is the fix.

The multiprocessing.Pool path is a fallback behind --disable-concurrent-futures, used for debugging. --disable-process-pooling was added as an opt-in for that fallback path specifically.

Gating the fix behind --disable-process-pooling would keep recycling opt-in on the default path, which is exactly the status quo that keeps crashing. Nobody passes that flag today, and requiring it would not fix the issue.

The performance cost of always recycling is negligible — Warp kernels are cached on disk, so worker re-initialization loads from cache rather than recompiling.

@vreutskyy, thanks for the clear explanation — that makes the design contract explicit and the reasoning is sound. The two flags operate on completely separate code paths with different roles, so there's no inconsistency. My concern is withdrawn.

✏️ Learnings added

Learnt from: vreutskyy Repo: newton-physics/newton PR: 2404 File: newton/tests/thirdparty/unittest_parallel.py:243-245 Timestamp: 2026-04-10T08:53:15.343Z Learning: In newton-physics/newton unittest_parallel.py, `max_tasks_per_child=1` on the `ProcessPoolExecutor` path is intentionally unconditional (guarded only by Python >= 3.11). The `ProcessPoolExecutor` branch is the default path that all users hit and must recycle workers out of the box to prevent BrokenProcessPool (issue `#1847`). The `--disable-process-pooling` flag is scoped exclusively to the `multiprocessing.Pool` fallback path (behind `--disable-concurrent-futures`), which is a debug-only path. Do not flag the absence of `--disable-process-pooling` gating on the `ProcessPoolExecutor` path as inconsistency.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: jumyungc Repo: newton-physics/newton PR: 1333 File: newton/_src/sim/builder.py:0-0 Timestamp: 2026-01-13T03:11:40.556Z Learning: Ensure all VBD damping semantics follow Rayleigh-style damping and are unitless. This convention, demonstrated in SolverVBD, should be applied consistently across all VBD-related constraints in the codebase. In newton/_src/sim/builder.py, validate that any damping terms use the same normalization and are treated as dimensionless; add comments and unit tests to enforce consistency across VBD components.

Learnt from: adenzler-nvidia Repo: newton-physics/newton PR: 1645 File: newton/tests/test_mujoco_solver.py:0-0 Timestamp: 2026-02-16T15:42:56.636Z Learning: In tests for the Newton solver, prefer importing and using SolverMuJoCo._mujoco_warp instead of importing mujoco_warp directly. This avoids hard dependencies on the external MJWarp package and enables skipping/xfailing tests gracefully when MJWarp isn't installed. Update test imports to import SolverMuJoCo and reference SolverMuJoCo._mujoco_warp, and guard optional behavior with pytest.importorskip or try/except ImportError as appropriate.

Learnt from: adenzler-nvidia Repo: newton-physics/newton PR: 1773 File: newton/tests/test_mujoco_solver.py:7550-7556 Timestamp: 2026-02-26T12:51:12.942Z Learning: Do not add defensive try/except guards around SolverMuJoCo construction in tests that rely on mujoco_warp features (e.g., set_length_range, set_const_0). The repo pins mujoco-warp >=3.5.0.1 (PR `#1773`), so tests may assume these APIs exist. If a test would fail due to a missing API, prefer failing fast or asserting the API/version instead of swallowing the error.

Learnt from: jvonmuralt Repo: newton-physics/newton PR: 1925 File: newton/examples/contacts/example_brick_stacking.py:385-385 Timestamp: 2026-03-05T12:15:57.546Z Learning: In Warp kernel code within the newton-physics/newton repository, composite type slicing (e.g., wp.vec4[:3], wp.mat33[:-1, 1:]) is supported since v1.9.0, and array-level negative indexing/slicing inside kernels was extended in v1.10.0. Expressions like ee_rot_target[tid][:3] (where ee_rot_target is wp.array(dtype=wp.vec4)) are valid and should not be flagged as unsupported slice syntax. Apply this understanding broadly to Python files in Warp contexts during code reviews.

Learnt from: eric-heiden Repo: newton-physics/newton PR: 1988 File: newton/tests/thirdparty/unittest_parallel.py:245-261 Timestamp: 2026-03-07T20:07:41.440Z Learning: In newton-physics/newton, this learning explains that the menagerie pre-download test list covers only non-skipped menagerie tests and test_robot_composer. Do not mark it incomplete based solely on raw robot_folder assignments. Be aware that in test_menagerie_mujoco.py placeholder MJCF classes set skip_reason and TestMenagerieBase.setUpClass() may skip before download_menagerie_asset() runs, and in test_menagerie_usd_mujoco.py placeholder USD stubs skip before super().setUpClass() when usd_asset_folder is unset, so downloads aren’t triggered. When reviewing this area, validate expectations against the actual skip timing and ensure that absence of a download early in test setup is intentional and not a sign of missing test coverage.

Learnt from: preist-nvidia Repo: newton-physics/newton PR: 1968 File: newton/_src/solvers/xpbd/solver_xpbd.py:249-250 Timestamp: 2026-03-09T14:12:43.599Z Learning: In the repository newton-physics/newton, PRs labeled '1.0-release' are considered part of the pre-stable API cleanup and are exempt from the deprecation-first policy and the [Unreleased] CHANGELOG entry requirement. Breaking changes included in the 1.0 release do not require a prior deprecation window. This applies to Python files across the codebase.

Learnt from: vreutskyy Repo: newton-physics/newton PR: 2074 File: newton/tests/test_import_usd.py:0-0 Timestamp: 2026-03-12T22:28:25.778Z Learning: In test_import_usd.py, if testing USD import cycles, ensure that the TestResolveUsdFromUrl.test_cyclic_references validates that each URL (scene.usd, a.usd, b.usd) is fetched exactly once to guard against root re-download regressions. Use mocks/spies on the fetch mechanism and assert call counts per URL, and verify no extra fetches occur during the cycle. This pattern should apply to other tests that exercise URL-based resource graphs and potential re-downloads.

Learnt from: shi-eric Repo: newton-physics/newton PR: 2242 File: newton/_src/utils/import_usd.py:184-184 Timestamp: 2026-03-27T16:48:54.491Z Learning: Because `parse_usd` in `newton/_src/utils/import_usd.py` is a private helper and is not re-exported as part of the public API (the public entry point is `ModelBuilder.add_usd`), avoid Sphinx cross-references (`:func:`) to `parse_usd` in docstrings. If you must mention it, use inline code literals (e.g., ``parse_usd()``) rather than `:func:` references, since there is no valid public Sphinx target for this private function.

Learnt from: eric-heiden Repo: newton-physics/newton PR: 2261 File: newton/_src/sim/builder.py:0-0 Timestamp: 2026-04-01T14:40:44.664Z Learning: When implementing a deprecation shim in the Python codebase, avoid using a sentinel/default strategy that can’t reliably distinguish “omitted” from “explicit default” if the only downside is a single corner-case ambiguity. Prefer keeping clean public signatures (no sentinel) unless doing so would violate a documented guarantee. If you keep the clean signature, ensure the documentation/release notes (docs/ and/or CHANGELOG) explicitly forbid mixing the old and new names/parameters in the same call to prevent precedence ambiguity (e.g., which name wins).

I have two follow up questions:

Can you quantify perf impact on CICD UT runs of this change?

Why is the change not limited to Windows?

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

-Original file line number
+Diff line change
@@ Expand Up / @@ -231,12 +231,18 @@ def main(argv=None): @@
                             results = pool.map(test_manager.run_tests, test_suites)
                     else:
                         # NVIDIA Modification added concurrent.futures
-                        with concurrent.futures.ProcessPoolExecutor(
-                            max_workers=process_count,
-                            mp_context=multiprocessing.get_context(method="spawn"),
-                            initializer=initialize_test_process,
-                            initargs=(manager.Lock(), shared_index, args, temp_dir),
-                        ) as executor:
+                        executor_kwargs = {
+                            "max_workers": process_count,
+                            "mp_context": multiprocessing.get_context(method="spawn"),
+                            "initializer": initialize_test_process,
+                            "initargs": (manager.Lock(), shared_index, args, temp_dir),
+                        }
+                        # Recycle workers after each test suite to prevent host
+                        # memory accumulation from Warp kernel compilation and
+                        # other allocations that persist across suites (see #1847).
+                        if sys.version_info >= (3, 11):
+                            executor_kwargs["max_tasks_per_child"] = 1
+                        with concurrent.futures.ProcessPoolExecutor(**executor_kwargs) as executor:
                             test_manager = ParallelTestManager(manager, args, temp_dir)
                             results = list(executor.map(test_manager.run_tests, test_suites, timeout=2400))
             else:
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle BrokenProcessPool with isolated single-process fallback #2404

Diff view

Diff view

There are no files selected for viewing

coderabbitai Bot Apr 10, 2026 •

edited

Loading

Uh oh!

vreutskyy Apr 10, 2026

Uh oh!

coderabbitai Bot Apr 10, 2026

Uh oh!

preist-nvidia Apr 10, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot Apr 10, 2026

Uh oh!

Uh oh!

Handle BrokenProcessPool with isolated single-process fallback #2404

Handle BrokenProcessPool with isolated single-process fallback #2404

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

coderabbitai Bot Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vreutskyy Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

preist-nvidia Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 10, 2026 •

edited

Loading

preist-nvidia Apr 10, 2026 •

edited

Loading