Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions nemo_rl/distributed/worker_groups.py
Original file line number Diff line number Diff line change
Expand Up @@ -497,8 +497,13 @@ def _create_workers_from_bundle_indices(
"AVAILABLE_PORT_LIST": str(available_ports),
}
)
# Remove Ray-specific environment variables, let the worker itself set them.
worker_env_vars.pop("RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES", None)
# Preserve RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1 to prevent Ray
# from masking CUDA_VISIBLE_DEVICES per actor. GPU masking triggers NCCL
# bugs on NVSwitch topologies (H200/P5en, H100/P5) including cuMem import
# penalty (nccl#1749) and NVLS rank ordering corruption (nccl#1906).
# Workers use explicit torch.cuda.set_device(local_rank) instead.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is inaccurate — no production nemo_rl code calls torch.cuda.set_device(local_rank). Device binding happens implicitly via init_device_mesh / Megatron internals reading the LOCAL_RANK env var.

Suggested change
# Workers use explicit torch.cuda.set_device(local_rank) instead.
# Workers rely on LOCAL_RANK env var for device selection via
# init_device_mesh / Megatron internals.

# See: https://github.com/NVIDIA-NeMo/RL/issues/1963
worker_env_vars["RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES"] = "1"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: DTensor V1 LOCAL_RANK=0 regression

This unconditional = "1" makes all GPUs visible to every worker. However, dtensor_policy_worker.py:320-329 has a LOCAL_RANK=0 hack that assumes only 1 GPU is visible:

# torch==2.8 uses LOCAL_RANK to set the device here
# but CUDA_VISIBLE_DEVICES is set to only 1 gpu, so we need to temporarily set LOCAL_RANK to 0.
prev_local_rank = os.environ["LOCAL_RANK"]
os.environ["LOCAL_RANK"] = "0"
device_mesh = torch.distributed.device_mesh.init_device_mesh(...)

With all GPUs visible, init_device_mesh calls set_device(0) for every worker → all workers fight over GPU 0 → OOM. DTensor V1 is the default (lm_policy.py:113: _v2=False). DTensor V2 is unaffected (uses FSDP2Manager which reads real LOCAL_RANK).

Fix: Remove the LOCAL_RANK=0 hack (lines 323-324). With all GPUs visible, init_device_mesh should use the real LOCAL_RANK=bundle_idx.

worker_env_vars.pop("RAY_CLIENT_MODE", None)
worker_env_vars.pop("RAY_JOB_ID", None)
worker_env_vars.pop("RAY_LD_PRELOAD", None)
Expand Down
Loading