Handle BrokenProcessPool with isolated single-process fallback by vreutskyy · Pull Request #2404 · newton-physics/newton

vreutskyy · 2026-04-10T08:18:51Z

Description

Ports crash-resilience improvements from Warp's unittest_parallel.py to handle the recurring BrokenProcessPool crash in the parallel test runner.

Problem

The parallel test runner uses ProcessPoolExecutor to run test suites across multiple workers. Intermittently — on Windows Horde machines (16 GB RAM), CI (64 GB RAM), and potentially other environments — a worker process dies abruptly, causing BrokenProcessPool which aborts the entire test run with no useful diagnostics.

The root cause has not been reliably reproduced despite multiple attempts (local 32 GB machines with up to 32 workers, Horde Ubuntu 32 GB with 32 workers pushing swap to 0 bytes free). Previous fixes addressed specific crash vectors (#1988 OpenUSD threading, #2262 Open3D Poisson threading) but the issue keeps recurring.

Approach

Instead of trying to prevent the crash (which we can't reliably reproduce), survive it gracefully — ported from Warp's battle-tested implementation:

Catch BrokenProcessPool (and TimeoutError) around the executor.map() call instead of letting it abort the run
In CI (GITHUB_ACTIONS, GITLAB_CI, CI): exit immediately to respect job timeout constraints
Outside CI: retry every test suite in isolated single-process mode — one fresh ProcessPoolExecutor(max_workers=1) per suite, with per-suite error reporting via create_crash_result()
Zero performance impact on the happy path — the fallback only runs when something breaks

Additional changes

Add _get_suite_name() and create_crash_result() helpers
Increase suite timeout from 2400s to 3600s (aligned with Warp)

Checklist

New or existing tests cover these changes
The documentation is up to date with these changes
CHANGELOG.md has been updated (if user-facing change)

Test plan

python -m newton.tests

The fallback path activates only when BrokenProcessPool (or TimeoutError) occurs during parallel execution. Normal test runs are unaffected — verified locally on Windows 11 (32 GB, 8 and 16 workers) and Horde Ubuntu (32 GB, 32 workers).

Bug fix

Steps to reproduce:

The crash is intermittent and environment-dependent. Originally reported on Win11 Horde machines with 16 GB RAM and L40 GPU. Has also appeared on Linux CI with 64 GB RAM. Could not be reproduced locally despite aggressive attempts (32 workers, swap exhaustion).

🤖 Generated with Claude Code

Summary by CodeRabbit

Tests
- Improved parallel test runner resilience: added per-suite wall-clock timeouts, better detection of crashed or hung suites, and an isolated single-process fallback for failing parallel runs.
- When running in CI, failures in the parallel runner now cause an immediate exit to surface issues promptly.

…physics#1847) The parallel test runner's ProcessPoolExecutor reuses worker processes indefinitely. On Windows (spawn context), each worker accumulates host memory from Warp kernel compilation, OpenUSD data structures, and test allocations that persist across test suites. When aggregate memory exceeds available RAM, Windows terminates a worker process, which surfaces as a BrokenProcessPool exception. Set max_tasks_per_child=1 (available in Python 3.11+) so each worker is recycled after completing one test suite. Warp kernels are cached on disk, so re-initialization after a worker restart loads from cache rather than recompiling. The legacy multiprocessing.Pool fallback already supports this via --disable-process-pooling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-10T08:19:07Z

📝 Walkthrough

Walkthrough

Replace the single bulk ProcessPoolExecutor.map(..., timeout=2400) run with guarded parallel execution that enforces a per-suite wall-clock _SUITE_TIMEOUT, catches TimeoutError/BrokenProcessPool/other exceptions, and falls back to isolated single-process execution per suite (or exits early on CI indicators). Adds _SUITE_TIMEOUT, BrokenProcessPool import, _get_suite_name, and create_crash_result usage.

Changes

Cohort / File(s)	Summary
Parallel test runner changes `newton/tests/thirdparty/unittest_parallel.py`	Replaced direct `executor.map(..., timeout=2400)` consumption with a try/except loop that applies a new `_SUITE_TIMEOUT` per suite result. On `TimeoutError`, `BrokenProcessPool`, or other exceptions, sets crash flags and either exits immediately when CI env vars (`CI`, `GITHUB_ACTIONS`, `GITLAB_CI`) are present or falls back to isolated execution: runs each suite in its own `ProcessPoolExecutor(max_workers=1)`, enforces `_SUITE_TIMEOUT` via `future.result(timeout=...)`, and creates crash results via `create_crash_result(...)`. Introduced `_get_suite_name(...)` usage for logging and imported `BrokenProcessPool`. Many new logs/warnings added around timeout/crash handling and fallback behavior.

Sequence Diagram(s)

sequenceDiagram
  rect rgba(220,240,255,0.5)
    participant Runner
    participant ParallelExecutor
    participant Worker
    participant FallbackExecutor
  end

  Runner->>ParallelExecutor: submit all suites (map)
  ParallelExecutor->>Worker: run suite tasks in parallel
  alt all results within _SUITE_TIMEOUT
    Worker-->>ParallelExecutor: suite results
    ParallelExecutor-->>Runner: yield results
  else Timeout / BrokenProcessPool / Exception
    ParallelExecutor-->>Runner: raise TimeoutError/BrokenProcessPool/Exception
    Runner->>Runner: check CI env (CI/GITHUB_ACTIONS/GITLAB_CI)
    alt CI present
      Runner-->>Runner: exit process (abort)
    else CI absent
      Runner->>FallbackExecutor: for each remaining suite submit single-worker executor
      FallbackExecutor->>Worker: run suite isolated (max_workers=1)
      alt result within _SUITE_TIMEOUT
        Worker-->>FallbackExecutor: suite result
        FallbackExecutor-->>Runner: yield result
      else timeout/crash
        FallbackExecutor-->>Runner: create_crash_result(suite)
      end
    end
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding crash-resilience with isolated single-process fallback for BrokenProcessPool handling.
Linked Issues check	✅ Passed	The changes implement crash-resilience for BrokenProcessPool by catching exceptions and retrying suites in isolated single-worker pools, directly addressing issue `#1847` objectives to prevent worker process termination.
Out of Scope Changes check	✅ Passed	All changes in unittest_parallel.py are within scope—adding exception handling, timeout limits, and fallback execution logic directly addresses the BrokenProcessPool issue without introducing unrelated modifications.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@newton/tests/thirdparty/unittest_parallel.py`:
- Around line 243-245: The ProcessPoolExecutor branch unconditionally sets
worker recycling on Python 3.11+ causing inconsistent behavior with the
multiprocessing.Pool branch; update the logic around executor_kwargs and
ProcessPoolExecutor so it respects the same CLI flag by checking
args.disable_process_pooling before adding max_tasks_per_child (i.e., only set
executor_kwargs["max_tasks_per_child"]=1 when sys.version_info >= (3,11) AND
args.disable_process_pooling is false), ensuring ProcessPoolExecutor,
executor_kwargs and args.disable_process_pooling are used consistently.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: c08308b9-0d81-4ac2-ac05-5be91563883c

📥 Commits

Reviewing files that changed from the base of the PR and between 8baee87 and 6b8e3ad.

📒 Files selected for processing (1)

newton/tests/thirdparty/unittest_parallel.py

coderabbitai · 2026-04-10T08:36:02Z

+                    if sys.version_info >= (3, 11):
+                        executor_kwargs["max_tasks_per_child"] = 1
+                    with concurrent.futures.ProcessPoolExecutor(**executor_kwargs) as executor:


⚠️ Potential issue | 🟠 Major

--disable-process-pooling no longer controls the default executor path.

On Python 3.11+, this now recycles ProcessPoolExecutor workers unconditionally, while the multiprocessing.Pool branch still keys recycling off args.disable_process_pooling. That makes backend behavior inconsistent and removes the existing opt-in/opt-out control from the default path. If the new default is intentional, please update the CLI contract/help/docs accordingly; otherwise, thread args.disable_process_pooling through here as well.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@newton/tests/thirdparty/unittest_parallel.py` around lines 243 - 245, The ProcessPoolExecutor branch unconditionally sets worker recycling on Python 3.11+ causing inconsistent behavior with the multiprocessing.Pool branch; update the logic around executor_kwargs and ProcessPoolExecutor so it respects the same CLI flag by checking args.disable_process_pooling before adding max_tasks_per_child (i.e., only set executor_kwargs["max_tasks_per_child"]=1 when sys.version_info >= (3,11) AND args.disable_process_pooling is false), ensuring ProcessPoolExecutor, executor_kwargs and args.disable_process_pooling are used consistently.

Intentional — the two paths serve different roles:

The ProcessPoolExecutor path is the default that all users hit. This is where Test error on Horde Win11 L40: concurrent.futures.process.BrokenProcessPool #1847 keeps recurring, so it needs to be safe out of the box. Making recycling unconditional here is the fix.

The multiprocessing.Pool path is a fallback behind --disable-concurrent-futures, used for debugging. --disable-process-pooling was added as an opt-in for that fallback path specifically.

Gating the fix behind --disable-process-pooling would keep recycling opt-in on the default path, which is exactly the status quo that keeps crashing. Nobody passes that flag today, and requiring it would not fix the issue.

The performance cost of always recycling is negligible — Warp kernels are cached on disk, so worker re-initialization loads from cache rather than recompiling.

@vreutskyy, thanks for the clear explanation — that makes the design contract explicit and the reasoning is sound. The two flags operate on completely separate code paths with different roles, so there's no inconsistency. My concern is withdrawn.

✏️ Learnings added

Learnt from: vreutskyy Repo: newton-physics/newton PR: 2404 File: newton/tests/thirdparty/unittest_parallel.py:243-245 Timestamp: 2026-04-10T08:53:15.343Z Learning: In newton-physics/newton unittest_parallel.py, `max_tasks_per_child=1` on the `ProcessPoolExecutor` path is intentionally unconditional (guarded only by Python >= 3.11). The `ProcessPoolExecutor` branch is the default path that all users hit and must recycle workers out of the box to prevent BrokenProcessPool (issue `#1847`). The `--disable-process-pooling` flag is scoped exclusively to the `multiprocessing.Pool` fallback path (behind `--disable-concurrent-futures`), which is a debug-only path. Do not flag the absence of `--disable-process-pooling` gating on the `ProcessPoolExecutor` path as inconsistency.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: jumyungc Repo: newton-physics/newton PR: 1333 File: newton/_src/sim/builder.py:0-0 Timestamp: 2026-01-13T03:11:40.556Z Learning: Ensure all VBD damping semantics follow Rayleigh-style damping and are unitless. This convention, demonstrated in SolverVBD, should be applied consistently across all VBD-related constraints in the codebase. In newton/_src/sim/builder.py, validate that any damping terms use the same normalization and are treated as dimensionless; add comments and unit tests to enforce consistency across VBD components.

Learnt from: adenzler-nvidia Repo: newton-physics/newton PR: 1645 File: newton/tests/test_mujoco_solver.py:0-0 Timestamp: 2026-02-16T15:42:56.636Z Learning: In tests for the Newton solver, prefer importing and using SolverMuJoCo._mujoco_warp instead of importing mujoco_warp directly. This avoids hard dependencies on the external MJWarp package and enables skipping/xfailing tests gracefully when MJWarp isn't installed. Update test imports to import SolverMuJoCo and reference SolverMuJoCo._mujoco_warp, and guard optional behavior with pytest.importorskip or try/except ImportError as appropriate.

Learnt from: adenzler-nvidia Repo: newton-physics/newton PR: 1773 File: newton/tests/test_mujoco_solver.py:7550-7556 Timestamp: 2026-02-26T12:51:12.942Z Learning: Do not add defensive try/except guards around SolverMuJoCo construction in tests that rely on mujoco_warp features (e.g., set_length_range, set_const_0). The repo pins mujoco-warp >=3.5.0.1 (PR `#1773`), so tests may assume these APIs exist. If a test would fail due to a missing API, prefer failing fast or asserting the API/version instead of swallowing the error.

Learnt from: jvonmuralt Repo: newton-physics/newton PR: 1925 File: newton/examples/contacts/example_brick_stacking.py:385-385 Timestamp: 2026-03-05T12:15:57.546Z Learning: In Warp kernel code within the newton-physics/newton repository, composite type slicing (e.g., wp.vec4[:3], wp.mat33[:-1, 1:]) is supported since v1.9.0, and array-level negative indexing/slicing inside kernels was extended in v1.10.0. Expressions like ee_rot_target[tid][:3] (where ee_rot_target is wp.array(dtype=wp.vec4)) are valid and should not be flagged as unsupported slice syntax. Apply this understanding broadly to Python files in Warp contexts during code reviews.

Learnt from: eric-heiden Repo: newton-physics/newton PR: 1988 File: newton/tests/thirdparty/unittest_parallel.py:245-261 Timestamp: 2026-03-07T20:07:41.440Z Learning: In newton-physics/newton, this learning explains that the menagerie pre-download test list covers only non-skipped menagerie tests and test_robot_composer. Do not mark it incomplete based solely on raw robot_folder assignments. Be aware that in test_menagerie_mujoco.py placeholder MJCF classes set skip_reason and TestMenagerieBase.setUpClass() may skip before download_menagerie_asset() runs, and in test_menagerie_usd_mujoco.py placeholder USD stubs skip before super().setUpClass() when usd_asset_folder is unset, so downloads aren’t triggered. When reviewing this area, validate expectations against the actual skip timing and ensure that absence of a download early in test setup is intentional and not a sign of missing test coverage.

Learnt from: preist-nvidia Repo: newton-physics/newton PR: 1968 File: newton/_src/solvers/xpbd/solver_xpbd.py:249-250 Timestamp: 2026-03-09T14:12:43.599Z Learning: In the repository newton-physics/newton, PRs labeled '1.0-release' are considered part of the pre-stable API cleanup and are exempt from the deprecation-first policy and the [Unreleased] CHANGELOG entry requirement. Breaking changes included in the 1.0 release do not require a prior deprecation window. This applies to Python files across the codebase.

Learnt from: vreutskyy Repo: newton-physics/newton PR: 2074 File: newton/tests/test_import_usd.py:0-0 Timestamp: 2026-03-12T22:28:25.778Z Learning: In test_import_usd.py, if testing USD import cycles, ensure that the TestResolveUsdFromUrl.test_cyclic_references validates that each URL (scene.usd, a.usd, b.usd) is fetched exactly once to guard against root re-download regressions. Use mocks/spies on the fetch mechanism and assert call counts per URL, and verify no extra fetches occur during the cycle. This pattern should apply to other tests that exercise URL-based resource graphs and potential re-downloads.

Learnt from: shi-eric Repo: newton-physics/newton PR: 2242 File: newton/_src/utils/import_usd.py:184-184 Timestamp: 2026-03-27T16:48:54.491Z Learning: Because `parse_usd` in `newton/_src/utils/import_usd.py` is a private helper and is not re-exported as part of the public API (the public entry point is `ModelBuilder.add_usd`), avoid Sphinx cross-references (`:func:`) to `parse_usd` in docstrings. If you must mention it, use inline code literals (e.g., ``parse_usd()``) rather than `:func:` references, since there is no valid public Sphinx target for this private function.

Learnt from: eric-heiden Repo: newton-physics/newton PR: 2261 File: newton/_src/sim/builder.py:0-0 Timestamp: 2026-04-01T14:40:44.664Z Learning: When implementing a deprecation shim in the Python codebase, avoid using a sentinel/default strategy that can’t reliably distinguish “omitted” from “explicit default” if the only downside is a single corner-case ambiguity. Prefer keeping clean public signatures (no sentinel) unless doing so would violate a documented guarantee. If you keep the clean signature, ensure the documentation/release notes (docs/ and/or CHANGELOG) explicitly forbid mixing the old and new names/parameters in the same call to prevent precedence ambiguity (e.g., which name wins).

I have two follow up questions:

Can you quantify perf impact on CICD UT runs of this change?

Why is the change not limited to Windows?

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

codecov · 2026-04-10T08:37:51Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

preist-nvidia · 2026-04-10T09:03:59Z

@Kenny-Vilella can you please have a look as well? We are delaying the release to Monday to resolve this.

Kenny-Vilella · 2026-04-10T09:18:40Z

FYI, this file is from warp. I think that warp has improved this file compare to the Newton version:
https://github.com/NVIDIA/warp/blob/main/warp/_src/thirdparty/unittest_parallel.py
Is the warp version would fix these issues?

We should ask someone from the warp team that is familiar with this file to advise.

…n-physics#1847) Port crash-resilience improvements from Warp's unittest_parallel.py. Instead of aborting the entire test run when a worker process dies unexpectedly, catch BrokenProcessPool (and TimeoutError) and retry all test suites in isolated single-process mode — one fresh ProcessPoolExecutor(max_workers=1) per suite. This replaces the earlier max_tasks_per_child approach: since the root cause of the worker crashes is not reliably reproducible, surviving the crash gracefully is more robust than trying to prevent it. The fallback has zero performance impact on the happy path. Key changes: - Catch BrokenProcessPool, TimeoutError, and generic exceptions around the parallel executor.map() call - On failure in CI (GITHUB_ACTIONS, GITLAB_CI, CI), exit immediately to respect job timeout constraints - On failure outside CI, retry each suite in an isolated single-worker pool with per-suite error reporting via create_crash_result() - Add _get_suite_name() and create_crash_result() helpers - Increase suite timeout from 2400s to 3600s (aligned with Warp) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@newton/tests/thirdparty/unittest_parallel.py`:
- Around line 241-265: The warning message incorrectly implies a per-suite
blocker even though executor.map(timeout=_SUITE_TIMEOUT) enforces a wall-clock
limit on the whole parallel phase; update the code to either implement a true
per-suite timeout (submit each suite via executor.submit and enforce per-future
timeouts with as_completed/future.result(timeout=_SUITE_TIMEOUT)) or, more
simply, change the warning text printed in the except TimeoutError block (the
print that references pending_index and suite_name) to remove the misleading
“different suite may be the actual blocker” wording and state that the parallel
phase hit the global wall-clock timeout; refer to executor.map, _SUITE_TIMEOUT,
ParallelTestManager.run_tests, pending_index and suite_name to locate the
relevant code to change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: b073c4b9-64ee-4afd-a482-0ba843ee6584

📥 Commits

Reviewing files that changed from the base of the PR and between 6b8e3ad and 0a5e7a6.

📒 Files selected for processing (1)

newton/tests/thirdparty/unittest_parallel.py

coderabbitai · 2026-04-10T12:41:14Z

+                    try:
+                        with concurrent.futures.ProcessPoolExecutor(
+                            max_workers=process_count,
+                            mp_context=multiprocessing.get_context(method="spawn"),
+                            initializer=initialize_test_process,
+                            initargs=(manager.Lock(), shared_index, args, temp_dir),
+                        ) as executor:
+                            test_manager = ParallelTestManager(manager, args, temp_dir)
+                            for result in executor.map(test_manager.run_tests, test_suites, timeout=_SUITE_TIMEOUT):
+                                results.append(result)
+
+                    except TimeoutError:
+                        pending_index = len(results)
+                        total = len(test_suites)
+                        suite_name = _get_suite_name(test_suites[pending_index]) if pending_index < total else "unknown"
+                        print(
+                            f"Warning: Parallel execution timed out (total timeout={_SUITE_TIMEOUT}s). "
+                            f"Next pending result was suite "
+                            f"{pending_index + 1}/{total} ({suite_name}), "
+                            f"but a different suite may be the actual blocker. "
+                            f"Switching to isolated single-process fallback.",
+                            file=sys.stderr,
+                        )
+                        parallel_failed = True
+                        parallel_fail_reason = "timed out"


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

In Python's concurrent.futuresdocumentation, doesExecutor.map(timeout=...)apply the timeout per item, or from the originalExecutor.map() call?

💡 Result:

Web search failed: Server error: no LLM provider could handle the message

🌐 Web query:

Python concurrent.futures Executor.map timeout semantics per item vs total duration

💡 Result:

Executor.map(..., timeout=T) is a single overall deadline, not “T seconds per item”.

Python sets a deadline at the moment you call Executor.map(). The iterator it returns will raise TimeoutError when you call __next__() and the next result isn’t ready by T seconds after that original map() call. So each next() effectively only gets the remaining time until the same global deadline. [1]

If you never iterate the returned iterator (e.g., you don’t loop it, list() it, etc.), then __next__() is never called and no timeout is triggered. [2]

This timeout does not cancel/stop running work; it only stops waiting for (ordered) results. Also, because map() yields results in input order, one slow early item can block yielding later items even if they’re already finished. [1]

If you want per-item timeouts, use submit() and then future.result(timeout=...) per future (or manage futures with as_completed()), rather than relying on Executor.map(timeout=...). [1]

Sources: Python docs for Executor.map() semantics [1]; discussion/example showing timeout only occurs during iteration [2].

🏁 Script executed:

# First, let's examine the actual code in the file to understand context cat -n newton/tests/thirdparty/unittest_parallel.py | sed -n '240,270p'

Repository: newton-physics/newton

Length of output: 2127

🏁 Script executed:

# Let's also check if there are any comments in the code about timeout intent rg -A 5 -B 5 "executor.map.*timeout" newton/tests/thirdparty/unittest_parallel.py

Repository: newton-physics/newton

Length of output: 741

🏁 Script executed:

# Check the definition of _SUITE_TIMEOUT to understand the intent rg "_SUITE_TIMEOUT" newton/tests/thirdparty/unittest_parallel.py

Repository: newton-physics/newton

Length of output: 684

The timeout comment and warning message are misleading about the timeout scope.

Line 249's executor.map(timeout=_SUITE_TIMEOUT) applies a wall-clock deadline to the entire parallel phase (not per-suite), which matches the _SUITE_TIMEOUT comment "Wall-clock limit for parallel execution". However, the warning message at line 257 states "Next pending result was suite ... but a different suite may be the actual blocker," which is inconsistent with wall-clock timeout semantics—under a global deadline, which suite is slow doesn't matter; the entire parallel phase has exhausted its time budget. If the intent is truly a per-suite wall-clock cap for parallel execution (to match the isolated fallback's per-suite behavior), clarify the code comment and revise the warning message to reflect that. Otherwise, update the warning to remove the misleading reference to specific suites being blockers when the timeout is global.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@newton/tests/thirdparty/unittest_parallel.py` around lines 241 - 265, The warning message incorrectly implies a per-suite blocker even though executor.map(timeout=_SUITE_TIMEOUT) enforces a wall-clock limit on the whole parallel phase; update the code to either implement a true per-suite timeout (submit each suite via executor.submit and enforce per-future timeouts with as_completed/future.result(timeout=_SUITE_TIMEOUT)) or, more simply, change the warning text printed in the except TimeoutError block (the print that references pending_index and suite_name) to remove the misleading “different suite may be the actual blocker” wording and state that the parallel phase hit the global wall-clock timeout; refer to executor.map, _SUITE_TIMEOUT, ParallelTestManager.run_tests, pending_index and suite_name to locate the relevant code to change.

shi-eric · 2026-04-10T13:09:11Z

FYI, this file is from warp. I think that warp has improved this file compare to the Newton version: https://github.com/NVIDIA/warp/blob/main/warp/_src/thirdparty/unittest_parallel.py Is the warp version would fix these issues?

We should ask someone from the warp team that is familiar with this file to advise.

I wouldn’t recommend adopting Warp’s approach. The single process fallback was meant for QA use so that they could get some results in light of known failures at the time. I think the right thing to do is hold off on merging anything until we actually characterize what’s going on to cause a BrokenProcessPool, e.g. set PYTHONFAULTHANDLER=1 in the environment. This should give more details about the crash, which is usually a segfault.

coderabbitai Bot reviewed Apr 10, 2026

View reviewed changes

preist-nvidia requested a review from Kenny-Vilella April 10, 2026 09:03

preist-nvidia added this to the 1.1 Release milestone Apr 10, 2026

preist-nvidia added this to Newton Roadmap Apr 10, 2026

preist-nvidia moved this to In Progress in Newton Roadmap Apr 10, 2026

preist-nvidia assigned vreutskyy Apr 10, 2026

preist-nvidia requested a review from shi-eric April 10, 2026 09:59

vreutskyy changed the title ~~Recycle ProcessPoolExecutor workers to fix BrokenProcessPool~~ Handle BrokenProcessPool with isolated single-process fallback Apr 10, 2026

coderabbitai Bot reviewed Apr 10, 2026

View reviewed changes

adenzler-nvidia modified the milestones: 1.1 Release, 1.2 Release Apr 13, 2026

vreutskyy added 4 commits April 14, 2026 10:36

Merge branch 'main' into vreutskyy/fix-broken-process-pool

5e358f5

Merge branch 'main' into vreutskyy/fix-broken-process-pool

c53c02c

Merge branch 'main' into vreutskyy/fix-broken-process-pool

7f0f7ea

Merge branch 'main' into vreutskyy/fix-broken-process-pool

d1dde68

Conversation

vreutskyy commented Apr 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Approach

Additional changes

Checklist

Test plan

Bug fix

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vreutskyy Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

preist-nvidia Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

preist-nvidia commented Apr 10, 2026

Uh oh!

Kenny-Vilella commented Apr 10, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

shi-eric commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vreutskyy commented Apr 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 10, 2026 •

edited

Loading

coderabbitai Bot Apr 10, 2026 •

edited

Loading

preist-nvidia Apr 10, 2026 •

edited

Loading

codecov Bot commented Apr 10, 2026 •

edited

Loading