fix(isaaclab_physx): only unsubscribe _on_stop to preserve warm_start callback#5371
fix(isaaclab_physx): only unsubscribe _on_stop to preserve warm_start callback#5371preeves-nv wants to merge 23 commits intoisaac-sim:developfrom
Conversation
…broken links Combined fixes from PR isaac-sim#5279 on top of deprecation migration from PR isaac-sim#5280: 1. install.py: Add prebundle probe functions that check pip_prebundle paths instead of extsDeprecated, and uninstall pip torch when prebundle shadows it 2. isaaclab/__init__.py: Deprioritize ml_archive/pip_prebundle on sys.path so pip-installed torch/numpy/etc take priority over prebundled copies 3. setup.py: Pin mujoco==3.5.0 4. Kit files: Remove omni.warp.core to prevent conflict with pip warp-lang 5. check-links.yml: Add ubuntu.com to link checker exclusions (returns 403) 6. New tests: test_install_commands.py, test_install_prebundle.py
- Add isaacsim.pip.newton and omni.warp.core to app.extensions.excluded in all 6 kit files to prevent Kit from loading these extensions - Add isaacsim.pip.newton to _CONFLICTING_EXTS in __init__.py so its pip_prebundle paths are deprioritized on sys.path - isaacsim.pip.newton prebundles newton 1.0.0 which shadows the pip-installed newton 1.1.0.dev0 needed by Isaac Lab - omni.warp.core loads a bundled warp that conflicts with pip-installed warp-lang used by Isaac Lab
Remove the pinned SHA digest from ISAACSIM_BASE_VERSION so CI tests against the most recent Isaac Sim nightly image, which includes the pip_prebundle path changes that this PR fixes.
The previous approach filtered only omni.isaac.ml_archive and isaacsim.pip.newton prebundle paths. On newer Isaac Sim nightly images, additional or renamed extensions may also inject conflicting packages via pip_prebundle. Match on 'pip_prebundle' in the path alone to catch all current and future prebundle directories.
The renderer's render() method launches warp kernels that populate the camera output tensors on the GPU. Without an explicit synchronization barrier, downstream torch operations (e.g. torch.mean on camera data) can execute on a different CUDA stream before the warp writes complete, resulting in 'CUDA error: an illegal memory access was encountered'. This was exposed by newer Isaac Sim nightly images that changed warp's default CUDA stream behavior. Add wp.synchronize() in Camera's _update_buffers_impl after the render pass to guarantee kernel completion before any torch access.
The render() method launches a warp reshape kernel via wp.launch() then immediately performs torch operations (depth clipping, inf assignment) on the same output buffer. These run on different CUDA streams — warp's stream vs torch's default stream — with no synchronization, causing illegal memory access errors that corrupt the CUDA context. Add wp.synchronize() after wp.launch() and before any torch post- processing within the per-data-type loop. This complements the existing wp.synchronize() in Camera._update_buffers_impl() which guards the render→read_output boundary.
Extend _deprioritize_prebundle_paths() to also demote known conflicting extension directories (omni.warp.core, omni.isaac.ml_archive, etc.) in addition to pip_prebundle paths. This prevents kit extensions that bundle their own Python packages from shadowing pip-installed versions. Also re-run the sanitization after SimulationApp + extensions load, since Kit may insert new paths onto sys.path during startup that were not present when isaaclab/__init__.py first ran.
The nightly Isaac Sim Docker (>= 04/14) ships warp 1.13.x inside kit/python/lib/python3.12/site-packages. setup_python_env.sh places that directory on PYTHONPATH ahead of pip's site-packages, so warp 1.13.x was imported despite the _pin_warp_import pre-import added in 4.6.8 — that pre-import was locking the kit warp, not the pip warp. Warp 1.13.x deprecates warp.types.array with changed semantics. omni.replicator.core accesses this symbol during rendering, triggering CUDA error 700 (illegal memory access). That error poisons the CUDA context: all subsequent warp kernel lookups return NULL, producing the "Failed to find forward kernel 'update_outdated_envs_kernel'" crash. Fix: extend _should_demote() to catch kit/python/lib/*/site-packages paths so that pip-installed warp-lang always wins. Also add explicit strip_hash=False to isaaclab.sensors.kernels as a defensive safety net, and broaden the AppLauncher post-startup diagnostic to cover kit paths.
The 4.6.9 path-demotion fix moved kit/python/lib/python3.12/site-packages to the end of sys.path, but warp-lang was not a pip requirement. With no pip-managed warp installed, the demoted kit warp 1.13.x was still the only warp available and was still imported. Root fix: add warp-lang>=1.0.0,<1.13 to pip requirements. When ./isaaclab.sh --install runs, pip uses kit's own Python (kit/python/bin/python3) which installs into kit/python/lib/python3.12/ site-packages — the same directory kit shipped warp 1.13.x — overwriting it with a compatible version. Also replace path-based warnings in _pin_warp_import and AppLauncher with version-based checks (>= 1.13). Path-based checks would fire false positives after pip installs warp-lang into kit's site-packages (a kit-managed path, but pip-managed content). Version-based checks correctly identify the actual incompatibility regardless of install location.
…Lab-public into kellyg/fix-installation
Made-with: Cursor
…Lab-public into kellyg/fix-installation
… callback Calling enable_all_default_callbacks(False) disables the warm_start/PLAY callback in addition to _on_stop. The warm_start callback initialises the rendering pipeline on sim.reset(); without it, tiled-camera and visualization- marker tests produce black frames and wrong outputs. Replace the blanket disable with a targeted unsubscribe of _default_callback_on_stop only — that is the sole callback that calls invalidate_physics() and wrecks the shared omni.physics.tensors view. All other callbacks (warm_start, stage_open, stage_close) are left intact. Fixes the following CI failures in PR isaac-sim#5358: - isaaclab_visualizers - isaaclab (core) [1/3] — test_tiled_camera, test_multi_tiled_camera, test_first_frame_textured_rendering, test_visualization_markers, test_ray_caster_camera Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
38c54dc to
3799153
Compare
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Summary
This PR fixes a critical runtime error (Simulation view object is invalidated) that occurs after sim.reset() with recent Isaac Sim builds. The root cause is that Isaac Sim's SimulationManager._on_stop callback now correctly fires (after being fixed upstream with @staticmethod) and calls invalidate_physics(), which wrecks the shared omni.physics.tensors view that PhysxManager relies on. The fix selectively unsubscribes only the _on_stop callback while preserving the warm_start callback needed for rendering initialization.
Architecture Impact
High cross-module impact. This change touches the simulation lifecycle at the deepest level:
isaaclab_physx/__init__.py: Patches Isaac Sim's SimulationManager at import timeisaaclab/__init__.py: New_deprioritize_prebundle_paths()affects all imports system-wideisaaclab/app/app_launcher.py: Re-runs path sanitization after Kit startupisaaclab/cli/commands/install.py: Major additions for prebundle torch detection, pinocchio installation, and nvidia namespace handling- All
.kitfiles: Excludes additional conflicting extensions - All PhysX asset/sensor files: Import path migration from
omni.physics.tensors.impl.apitoomni.physics.tensors.api
The changes are defensive and should be backward-compatible, but the monkey-patching in isaaclab_physx/__init__.py is inherently fragile if Isaac Sim's internal API changes.
Implementation Verdict
Minor fixes needed — The core fix is correct and well-reasoned, but there are a few concerns.
Test Coverage
Excellent. The PR adds 867 lines of new tests in test_install_commands.py and test_install_prebundle.py covering:
- Torch prebundle detection across environment types (uv, pip, conda, kit)
- Architecture-specific CUDA torch installation (x86 cu128 vs ARM cu130)
- Prebundle repointing with nvidia namespace special handling
- Edge cases: stale symlinks, missing backups, OSError recovery, Windows copy fallback
The tests are well-structured with parametrization and proper mocking. However, there's no direct integration test for the core SimulationManager patching fix itself — that relies on CI passing the previously-failing tests.
CI Status
Only labeler has completed (success). The actual test jobs haven't run yet, which are critical for validating this fix.
Findings
🔴 Critical: source/isaaclab_physx/isaaclab_physx/init.py:61-70 — Race condition in callback teardown
if hasattr(original_class, "_default_callback_on_stop"):
original_class._default_callback_on_stop = None
else:
try:
original_class.enable_all_default_callbacks(False)
except Exception:
passThe fallback to enable_all_default_callbacks(False) contradicts the PR's stated goal of preserving warm_start. If _default_callback_on_stop doesn't exist (API change), this silently disables all callbacks including the rendering pipeline. The comment acknowledges this but a silent pass on exception means broken rendering with no diagnostic. Should at minimum log a warning.
🟡 Warning: source/isaaclab_physx/isaaclab_physx/init.py:48-52 — Force-import may have side effects
try:
import isaacsim.core.simulation_manager # noqa: F401
except ImportError:
returnForce-importing the module triggers its extension startup code which registers callbacks. This is intentional but creates an ordering dependency — if this import happens before Kit is fully initialized, it could fail or behave unexpectedly. The except ImportError is appropriate but doesn't catch RuntimeError or other Kit initialization errors.
🟡 Warning: source/isaaclab/isaaclab/init.py:52-54 — PYTHONPATH rewriting affects subprocesses
os.environ["PYTHONPATH"] = os.pathsep.join(env_clean + env_demoted)This modifies the environment variable permanently for the process and all subprocesses. While necessary for consistency, this could cause unexpected behavior if a subprocess legitimately needs the original ordering. Consider documenting this side effect.
🟡 Warning: source/isaaclab/isaaclab/cli/commands/install.py:513-515 — nvidia cudnn check is brittle
if pkg_name == "nvidia" and not (venv_pkg / "cudnn").exists():
print_debug(f"Skipping repoint of {prebundled}: {venv_pkg} lacks CUDA subpackages (cudnn missing).")
continueThe cudnn subdirectory check assumes the torch CUDA wheel always creates this structure. If NVIDIA changes the package layout (e.g., nvidia_cudnn_cu12 vs nvidia/cudnn), this heuristic will fail silently. The test coverage is good but testing against the real package structure would be more robust.
🔵 Improvement: source/isaaclab_physx/isaaclab_physx/renderers/isaac_rtx_renderer.py:296-307 — Early return loses frame data silently
if tiled_data_buffer.size == 0:
continueWhen the annotator returns an empty buffer, the code now silently skips that data type. This is correct for preventing the CUDA crash, but downstream code may expect valid data. Consider logging at debug level when this happens so users can understand why the first frame(s) may have zero values.
🔵 Improvement: source/isaaclab/isaaclab/cli/commands/install.py:172-176 — Warning message could be actionable
print_warning(
"Force-installing the cmeel pinocchio stack failed (returncode "
f"{install_result.returncode}). The pink IK controller and its tests will not be"
" usable until ``pin pin-pink==3.1.0 daqp==0.7.2`` is installed manually."
)Good that it doesn't abort, but the warning should include the actual error output (install_result.stderr) to help users diagnose the failure.
🔵 Improvement: apps/*.kit files — Duplicated exclusion list
All .kit files now have identical app.extensions.excluded lists:
app.extensions.excluded = ["omni.kit.pip_archive", "omni.isaac.ml_archive", "isaacsim.pip.newton", "omni.warp.core"]
Consider extracting this to a shared include file to avoid drift. The isaaclab.python.headless.kit line 109 and others repeat this pattern.
Signed-off-by: Kelly Guo <kellyg@nvidia.com>
Signed-off-by: Kelly Guo <kellyg@nvidia.com>
Summary
Companion fix for #5358. Calling
enable_all_default_callbacks(False)disables thewarm_start(PLAY) callback in addition to_on_stop, breaking the rendering pipeline onsim.reset()— tiled-camera RGB output stays black and visualization markers fail.warm_startis the PLAY-event callback in Isaac Sim'sSimulationManagerthat initialises the rendering pipeline when simulation starts. Without it firing onsim.reset(), tiled cameras and visualization markers have no renderer to draw into.Replace the blanket disable with a targeted unsubscribe of
_default_callback_on_stoponly — the sole callback that callsinvalidate_physics()and wrecks the sharedomni.physics.tensorsview.Expected CI impact
The following failures seen in #5358 are consistent with the
warm_startcallback being disabled:isaaclab (core) [1/3]—test_first_frame_textured_rendering(0/1),test_tiled_camera(6/8),test_multi_tiled_camera(6/11),test_visualization_markers(5/6)isaaclab_visualizersisaaclab_physxThe following may also be resolved but were not directly inspected:
isaaclab_tasks [1/3],[2/3],[3/3]environments_trainingisaaclab (core) [3/3]Test plan
isaaclab_visualizersisaaclab (core) [1/3]camera testsisaaclab_physx🤖 Generated with Claude Code