Skip to content

fix(isaaclab_physx): only unsubscribe _on_stop to preserve warm_start callback#5371

Open
preeves-nv wants to merge 23 commits intoisaac-sim:developfrom
preeves-nv:preeves/fix-on-stop-targeted
Open

fix(isaaclab_physx): only unsubscribe _on_stop to preserve warm_start callback#5371
preeves-nv wants to merge 23 commits intoisaac-sim:developfrom
preeves-nv:preeves/fix-on-stop-targeted

Conversation

@preeves-nv
Copy link
Copy Markdown
Collaborator

@preeves-nv preeves-nv commented Apr 23, 2026

Summary

Companion fix for #5358. Calling enable_all_default_callbacks(False) disables the warm_start (PLAY) callback in addition to _on_stop, breaking the rendering pipeline on sim.reset() — tiled-camera RGB output stays black and visualization markers fail.

warm_start is the PLAY-event callback in Isaac Sim's SimulationManager that initialises the rendering pipeline when simulation starts. Without it firing on sim.reset(), tiled cameras and visualization markers have no renderer to draw into.

Replace the blanket disable with a targeted unsubscribe of _default_callback_on_stop only — the sole callback that calls invalidate_physics() and wrecks the shared omni.physics.tensors view.

Expected CI impact

The following failures seen in #5358 are consistent with the warm_start callback being disabled:

  • isaaclab (core) [1/3]test_first_frame_textured_rendering (0/1), test_tiled_camera (6/8), test_multi_tiled_camera (6/11), test_visualization_markers (5/6)
  • isaaclab_visualizers
  • isaaclab_physx

The following may also be resolved but were not directly inspected:

  • isaaclab_tasks [1/3], [2/3], [3/3]
  • environments_training
  • isaaclab (core) [3/3]

Test plan

  • CI passes isaaclab_visualizers
  • CI passes isaaclab (core) [1/3] camera tests
  • CI passes isaaclab_physx

🤖 Generated with Claude Code

Kelly Guo and others added 21 commits April 22, 2026 11:16
…broken links

Combined fixes from PR isaac-sim#5279 on top of deprecation migration from PR isaac-sim#5280:

1. install.py: Add prebundle probe functions that check pip_prebundle paths
   instead of extsDeprecated, and uninstall pip torch when prebundle shadows it
2. isaaclab/__init__.py: Deprioritize ml_archive/pip_prebundle on sys.path
   so pip-installed torch/numpy/etc take priority over prebundled copies
3. setup.py: Pin mujoco==3.5.0
4. Kit files: Remove omni.warp.core to prevent conflict with pip warp-lang
5. check-links.yml: Add ubuntu.com to link checker exclusions (returns 403)
6. New tests: test_install_commands.py, test_install_prebundle.py
- Add isaacsim.pip.newton and omni.warp.core to app.extensions.excluded
  in all 6 kit files to prevent Kit from loading these extensions
- Add isaacsim.pip.newton to _CONFLICTING_EXTS in __init__.py so its
  pip_prebundle paths are deprioritized on sys.path
- isaacsim.pip.newton prebundles newton 1.0.0 which shadows the
  pip-installed newton 1.1.0.dev0 needed by Isaac Lab
- omni.warp.core loads a bundled warp that conflicts with pip-installed
  warp-lang used by Isaac Lab
Remove the pinned SHA digest from ISAACSIM_BASE_VERSION so CI tests
against the most recent Isaac Sim nightly image, which includes the
pip_prebundle path changes that this PR fixes.
The previous approach filtered only omni.isaac.ml_archive and
isaacsim.pip.newton prebundle paths. On newer Isaac Sim nightly
images, additional or renamed extensions may also inject conflicting
packages via pip_prebundle. Match on 'pip_prebundle' in the path
alone to catch all current and future prebundle directories.
The renderer's render() method launches warp kernels that populate the
camera output tensors on the GPU. Without an explicit synchronization
barrier, downstream torch operations (e.g. torch.mean on camera data)
can execute on a different CUDA stream before the warp writes complete,
resulting in 'CUDA error: an illegal memory access was encountered'.

This was exposed by newer Isaac Sim nightly images that changed warp's
default CUDA stream behavior. Add wp.synchronize() in Camera's
_update_buffers_impl after the render pass to guarantee kernel
completion before any torch access.
The render() method launches a warp reshape kernel via wp.launch() then
immediately performs torch operations (depth clipping, inf assignment)
on the same output buffer. These run on different CUDA streams — warp's
stream vs torch's default stream — with no synchronization, causing
illegal memory access errors that corrupt the CUDA context.

Add wp.synchronize() after wp.launch() and before any torch post-
processing within the per-data-type loop. This complements the existing
wp.synchronize() in Camera._update_buffers_impl() which guards the
render→read_output boundary.
Extend _deprioritize_prebundle_paths() to also demote known conflicting
extension directories (omni.warp.core, omni.isaac.ml_archive, etc.)
in addition to pip_prebundle paths.  This prevents kit extensions that
bundle their own Python packages from shadowing pip-installed versions.

Also re-run the sanitization after SimulationApp + extensions load,
since Kit may insert new paths onto sys.path during startup that were
not present when isaaclab/__init__.py first ran.
The nightly Isaac Sim Docker (>= 04/14) ships warp 1.13.x inside
kit/python/lib/python3.12/site-packages. setup_python_env.sh places
that directory on PYTHONPATH ahead of pip's site-packages, so warp
1.13.x was imported despite the _pin_warp_import pre-import added in
4.6.8 — that pre-import was locking the kit warp, not the pip warp.

Warp 1.13.x deprecates warp.types.array with changed semantics.
omni.replicator.core accesses this symbol during rendering, triggering
CUDA error 700 (illegal memory access). That error poisons the CUDA
context: all subsequent warp kernel lookups return NULL, producing the
"Failed to find forward kernel 'update_outdated_envs_kernel'" crash.

Fix: extend _should_demote() to catch kit/python/lib/*/site-packages
paths so that pip-installed warp-lang always wins. Also add explicit
strip_hash=False to isaaclab.sensors.kernels as a defensive safety net,
and broaden the AppLauncher post-startup diagnostic to cover kit paths.
The 4.6.9 path-demotion fix moved kit/python/lib/python3.12/site-packages
to the end of sys.path, but warp-lang was not a pip requirement. With no
pip-managed warp installed, the demoted kit warp 1.13.x was still the only
warp available and was still imported.

Root fix: add warp-lang>=1.0.0,<1.13 to pip requirements. When
./isaaclab.sh --install runs, pip uses kit's own Python
(kit/python/bin/python3) which installs into kit/python/lib/python3.12/
site-packages — the same directory kit shipped warp 1.13.x — overwriting
it with a compatible version.

Also replace path-based warnings in _pin_warp_import and AppLauncher with
version-based checks (>= 1.13). Path-based checks would fire false
positives after pip installs warp-lang into kit's site-packages (a
kit-managed path, but pip-managed content). Version-based checks correctly
identify the actual incompatibility regardless of install location.
… callback

Calling enable_all_default_callbacks(False) disables the warm_start/PLAY
callback in addition to _on_stop. The warm_start callback initialises the
rendering pipeline on sim.reset(); without it, tiled-camera and visualization-
marker tests produce black frames and wrong outputs.

Replace the blanket disable with a targeted unsubscribe of _default_callback_on_stop
only — that is the sole callback that calls invalidate_physics() and wrecks the
shared omni.physics.tensors view. All other callbacks (warm_start, stage_open,
stage_close) are left intact.

Fixes the following CI failures in PR isaac-sim#5358:
- isaaclab_visualizers
- isaaclab (core) [1/3] — test_tiled_camera, test_multi_tiled_camera,
  test_first_frame_textured_rendering, test_visualization_markers,
  test_ray_caster_camera

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added bug Something isn't working documentation Improvements or additions to documentation asset New asset feature or request isaac-sim Related to Isaac Sim team isaac-mimic Related to Isaac Mimic team infrastructure labels Apr 23, 2026
@preeves-nv preeves-nv marked this pull request as ready for review April 23, 2026 12:06
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Summary

This PR fixes a critical runtime error (Simulation view object is invalidated) that occurs after sim.reset() with recent Isaac Sim builds. The root cause is that Isaac Sim's SimulationManager._on_stop callback now correctly fires (after being fixed upstream with @staticmethod) and calls invalidate_physics(), which wrecks the shared omni.physics.tensors view that PhysxManager relies on. The fix selectively unsubscribes only the _on_stop callback while preserving the warm_start callback needed for rendering initialization.

Architecture Impact

High cross-module impact. This change touches the simulation lifecycle at the deepest level:

  • isaaclab_physx/__init__.py: Patches Isaac Sim's SimulationManager at import time
  • isaaclab/__init__.py: New _deprioritize_prebundle_paths() affects all imports system-wide
  • isaaclab/app/app_launcher.py: Re-runs path sanitization after Kit startup
  • isaaclab/cli/commands/install.py: Major additions for prebundle torch detection, pinocchio installation, and nvidia namespace handling
  • All .kit files: Excludes additional conflicting extensions
  • All PhysX asset/sensor files: Import path migration from omni.physics.tensors.impl.api to omni.physics.tensors.api

The changes are defensive and should be backward-compatible, but the monkey-patching in isaaclab_physx/__init__.py is inherently fragile if Isaac Sim's internal API changes.

Implementation Verdict

Minor fixes needed — The core fix is correct and well-reasoned, but there are a few concerns.

Test Coverage

Excellent. The PR adds 867 lines of new tests in test_install_commands.py and test_install_prebundle.py covering:

  • Torch prebundle detection across environment types (uv, pip, conda, kit)
  • Architecture-specific CUDA torch installation (x86 cu128 vs ARM cu130)
  • Prebundle repointing with nvidia namespace special handling
  • Edge cases: stale symlinks, missing backups, OSError recovery, Windows copy fallback

The tests are well-structured with parametrization and proper mocking. However, there's no direct integration test for the core SimulationManager patching fix itself — that relies on CI passing the previously-failing tests.

CI Status

Only labeler has completed (success). The actual test jobs haven't run yet, which are critical for validating this fix.

Findings

🔴 Critical: source/isaaclab_physx/isaaclab_physx/init.py:61-70 — Race condition in callback teardown

if hasattr(original_class, "_default_callback_on_stop"):
    original_class._default_callback_on_stop = None
else:
    try:
        original_class.enable_all_default_callbacks(False)
    except Exception:
        pass

The fallback to enable_all_default_callbacks(False) contradicts the PR's stated goal of preserving warm_start. If _default_callback_on_stop doesn't exist (API change), this silently disables all callbacks including the rendering pipeline. The comment acknowledges this but a silent pass on exception means broken rendering with no diagnostic. Should at minimum log a warning.

🟡 Warning: source/isaaclab_physx/isaaclab_physx/init.py:48-52 — Force-import may have side effects

try:
    import isaacsim.core.simulation_manager  # noqa: F401
except ImportError:
    return

Force-importing the module triggers its extension startup code which registers callbacks. This is intentional but creates an ordering dependency — if this import happens before Kit is fully initialized, it could fail or behave unexpectedly. The except ImportError is appropriate but doesn't catch RuntimeError or other Kit initialization errors.

🟡 Warning: source/isaaclab/isaaclab/init.py:52-54 — PYTHONPATH rewriting affects subprocesses

os.environ["PYTHONPATH"] = os.pathsep.join(env_clean + env_demoted)

This modifies the environment variable permanently for the process and all subprocesses. While necessary for consistency, this could cause unexpected behavior if a subprocess legitimately needs the original ordering. Consider documenting this side effect.

🟡 Warning: source/isaaclab/isaaclab/cli/commands/install.py:513-515 — nvidia cudnn check is brittle

if pkg_name == "nvidia" and not (venv_pkg / "cudnn").exists():
    print_debug(f"Skipping repoint of {prebundled}: {venv_pkg} lacks CUDA subpackages (cudnn missing).")
    continue

The cudnn subdirectory check assumes the torch CUDA wheel always creates this structure. If NVIDIA changes the package layout (e.g., nvidia_cudnn_cu12 vs nvidia/cudnn), this heuristic will fail silently. The test coverage is good but testing against the real package structure would be more robust.

🔵 Improvement: source/isaaclab_physx/isaaclab_physx/renderers/isaac_rtx_renderer.py:296-307 — Early return loses frame data silently

if tiled_data_buffer.size == 0:
    continue

When the annotator returns an empty buffer, the code now silently skips that data type. This is correct for preventing the CUDA crash, but downstream code may expect valid data. Consider logging at debug level when this happens so users can understand why the first frame(s) may have zero values.

🔵 Improvement: source/isaaclab/isaaclab/cli/commands/install.py:172-176 — Warning message could be actionable

print_warning(
    "Force-installing the cmeel pinocchio stack failed (returncode "
    f"{install_result.returncode}). The pink IK controller and its tests will not be"
    " usable until ``pin pin-pink==3.1.0 daqp==0.7.2`` is installed manually."
)

Good that it doesn't abort, but the warning should include the actual error output (install_result.stderr) to help users diagnose the failure.

🔵 Improvement: apps/*.kit files — Duplicated exclusion list
All .kit files now have identical app.extensions.excluded lists:

app.extensions.excluded = ["omni.kit.pip_archive", "omni.isaac.ml_archive", "isaacsim.pip.newton", "omni.warp.core"]

Consider extracting this to a shared include file to avoid drift. The isaaclab.python.headless.kit line 109 and others repeat this pattern.

Signed-off-by: Kelly Guo <kellyg@nvidia.com>
Signed-off-by: Kelly Guo <kellyg@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

asset New asset feature or request bug Something isn't working documentation Improvements or additions to documentation infrastructure isaac-lab Related to Isaac Lab team isaac-mimic Related to Isaac Mimic team isaac-sim Related to Isaac Sim team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants