feat(pytorch): instrument distributed training with pytorch.rank lifetime span#18449
feat(pytorch): instrument distributed training with pytorch.rank lifetime span#18449kr-igor wants to merge 15 commits into
Conversation
|
✨ Fix all issues with BitsAI or with Cursor
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3e8a50a2f2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Codeowners resolved as |
BenchmarksBenchmark execution time: 2026-06-11 22:17:23 Comparing candidate commit 03f2c01 in PR branch Found 0 performance improvements and 3 performance regressions! Performance is the same for 614 metrics, 10 unstable metrics. scenario:iastaspects-index_aspect
scenario:iastaspectsospath-ospathbasename_aspect
scenario:span-start
|
cb263bb to
00c7495
Compare
brettlangdon
left a comment
There was a problem hiding this comment.
initial feedback, I haven't had a chance to go through the pytorch integration + tests yet
4493daf to
a8e0346
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7304981840
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…time span Adds a pytorch integration that instruments distributed PyTorch training jobs by wrapping torch.distributed.init_process_group / destroy_process_group to open and close a pytorch.rank lifetime span per worker process. Key features: - pytorch.rank span carries rank, world_size, training_job.id (from RAY_JOB_ID, TORCHELASTIC_RUN_ID, KUBEFLOW_TRAINING_JOB_ID, SLURM_JOB_ID, or DD_PYTORCH_JOB_ID), device info (GPU UUID via pynvml / torch props), and framework tag (ddp / fsdp / deepspeed) - Cross-rank correlation via shared training_job.id tag - Ray Train integration: propagates _DD_RAY_RUN_METADATA JSON env var into ray.submission_id and ray.metadata.job_name span tags - Fork-safe: at-fork hooks reset bootstrap state; _installed preserved to match inherited wrapt wrappers so uninstall() works correctly in child processes - UUID CUDA_VISIBLE_DEVICES support: uses nvmlDeviceGetHandleByUUID when CUDA_VISIBLE_DEVICES contains GPU-/MIG- UUIDs (k8s NVIDIA device plugin) - skip_venv_artifacts + skip_pip_cache in suitespec to avoid torch venv upload timeouts in CI - _DD_RAY_RUN_METADATA type corrected to "json" in supported-configurations.json
5c02ace to
89cc563
Compare
…acer Adds dd_training_step_begin/end ctypes bindings in _c_tracer.py and wraps torch.optim.Optimizer.step (gated on DD_TRAINING_STEP_PROFILING=true) so the C tracer gets precise step boundaries for Layer 2 per-step spans. Also fires step_begin after distributed bootstrap and step_end before the rank-root span closes, bounding the first and last steps of training.
…tributeError PropertyMock on type(lib) does not override MagicMock's auto-attribute creation. Using spec= restricts the mock to only declared attributes, so accessing dd_training_step_begin/end raises AttributeError as intended.
The ctypes.CDLL mock approach was unreliable: accessing an attribute on a MagicMock used as return_value did not raise AttributeError as intended. Replace with direct module-state injection (same pattern as the existing swallows_exception tests): set _loaded=True, _lib=object(), and _step_begin/end_fn to None or a Mock. This tests the actual behavior of step_begin/end without depending on _load() internals.
- _c_tracer.py: add service tag to the context snapshot passed to C
tracer; reads ddtrace.config.service (i.e. DD_SERVICE) at call time
- _rank_root.py: pytorch.rank span now uses ddtrace.config.service when
set, falling back to ext_service("pytorch") — inherits Ray service name
- test_c_tracer.py: update tag count assertion from 4 to 5 and verify
the service key is present in the payload
…D_SERVICE ext_service doesn't check the global service; int_service does, and it already guards against using the inferred-base-service (auto-derived module name), so DD_SERVICE set explicitly by the user (e.g. a Ray job service) propagates correctly while the default 'pytorch' is preserved when DD_SERVICE is unset.
…onfig.service ddtrace.config.service can be empty or the inferred module name; the pytorch.rank span's actual service (set by int_service, which picks up DD_SERVICE correctly) is the authoritative value to pass through.
|
|
||
| def install() -> None: | ||
| global _installed | ||
| with _install_lock: |
There was a problem hiding this comment.
why do we need a lock and global installed marker?
isn't this managed/called by patch() so a single entrypoint that is GIL bound and already handles the "is installed" bit ?
There was a problem hiding this comment.
_installed guards the post-import hook callbacks for FSDP/DeepSpeed. Wrapt has no deregister API, so without it a late import torch.distributed.fsdp after unpatch() would re-wrap the constructor. Happy to simplify by reusing torch._datadog_patch directly in the callbacks instead of a separate flag, if you prefer it.
| .analysis/ | ||
|
|
||
| # file created when running scripts/lint | ||
| uv.lock No newline at end of file |
There was a problem hiding this comment.
we don't want to remove this .gitignore, can we add it back?
4e27e64 to
f6bde19
Compare
b9944a2 to
034d2f5
Compare
…straint - Re-add tests/contrib/pytorch/__init__.py (lost during squash) so that PatchTestCase subprocess re-runs can resolve the test module by name - Remove child_of=None from _build_span so pytorch.rank inherits the active context (e.g. nests under ray.train.worker when Ray is active) - Simplify supported version constraint from >=2.0,<3.0 to >=2.0 so the integration registry test (which uses Specifier, not SpecifierSet) can parse it; upper bound is enforced at runtime in patch() anyway
400fd44 to
06c6e1b
Compare
…arenting Two test fixes: - patch.py: restore `TORCH_VERSION >= (3, 0, 0)` guard that was dropped when simplifying _supported_versions() to `">=2.0"` — the registry string only needs the lower bound but the runtime check still enforces <3.0. - _rank_root._build_span: pass child_of=tracer.current_span() when framework is "ray" so pytorch.rank nests under the active ray.train.worker span; activate=False alone does not inherit the active context.
06c6e1b to
5ed0af7
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5ed0af7767
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if not hasattr(torch.optim, "Optimizer"): | ||
| return | ||
| try: | ||
| _wrap("torch.optim", "Optimizer.step", _wrapped_optimizer_step) |
There was a problem hiding this comment.
Wrap concrete optimizer step implementations
With DD_TRAINING_STEP_PROFILING enabled, this wraps only torch.optim.Optimizer.step, but common PyTorch optimizers such as SGD/Adam/AdamW define their own step methods, so Python method resolution calls the subclass implementation and never reaches this wrapper. In those normal training loops, the C tracer only gets the initial step_begin() from bootstrap and never receives per-step step_end()/step_begin() signals, making the opt-in step profiling ineffective for standard optimizers.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
This is fine in the initial version. Will update in a followup PR
| _cached_distributed_backend = None | ||
| _fsdp_hook_registered = False | ||
| _deepspeed_hook_registered = False | ||
| _optimizer_wrapped = False |
There was a problem hiding this comment.
Preserve optimizer wrapper state after fork
In a forked child with DD_TRAINING_STEP_PROFILING=true, the Optimizer.step wrapt wrapper installed in the parent is inherited, but this reset marks _optimizer_wrapped as False. If the child then calls ddtrace.unpatch(pytorch=True), _uninstall_optimizer_step() returns early on that flag and leaves the inherited optimizer wrapper active after unpatch; fresh evidence in this version is that _installed is no longer cleared after fork, but the optimizer-specific installed flag still is.
Useful? React with 👍 / 👎.
Summary
torch.distributed.init_process_group/destroy_process_groupto emit a singlepytorch.rankspan per rank for the lifetime of the process group.rank,world_size,framework(DDP / FSDP / DeepSpeed),launcher,torch.distributed.backend,training_job_id(auto-resolved fromRAY_JOB_ID/TORCHELASTIC_RUN_ID/KUBEFLOW_TRAINING_JOB_ID/SLURM_JOB_ID).ray.train.run_name,ray.submission_id,ray.metadata.*._dd.was_long_running=1.dd-trace-c) is present via LD_PRELOAD, all GPU-level root spans it creates automatically become children ofpytorch.rankviadd_set_global_parent_context/dd_clear_global_parent_context.DD_PATCH_MODULES=pytorch:true.DD_TRACE_PYTORCH_ENABLEDandDD_PYTORCH_JOB_IDare registered in the configuration registry.Validation
scripts/run-tests --venv 116b0b8)python scripts/supported_configurations.py --check)pr_name_lintpassesast-greprules pass (usesenv.get()notos.environ.get())pytorch.rankspan appears in Datadog UI when running a distributed training job withDD_PATCH_MODULES=pytorch:trueFiles
ddtrace/contrib/internal/pytorch/ddtrace/_monkey.pypytorch: False(opt-in only)ddtrace/internal/settings/_config.pypytorchinINTEGRATION_CONFIGSsupported-configurations.json+_supported_configurations.pyDD_TRACE_PYTORCH_ENABLED,DD_PYTORCH_JOB_ID,DD_PYTORCH_SERVICEriotfile.py+tests/contrib/suitespec.yml.riot/requirements/tests/contrib/pytorch/releasenotes/notes/pytorch-rank-span-*.yaml