feat(pytorch): instrument distributed training with pytorch.rank lifetime span by kr-igor · Pull Request #18449 · DataDog/dd-trace-py

kr-igor · 2026-06-03T18:02:32Z

Summary

Patches torch.distributed.init_process_group / destroy_process_group to emit a single pytorch.rank span per rank for the lifetime of the process group.
Tags: rank, world_size, framework (DDP / FSDP / DeepSpeed), launcher, torch.distributed.backend, training_job_id (auto-resolved from RAY_JOB_ID / TORCHELASTIC_RUN_ID / KUBEFLOW_TRAINING_JOB_ID / SLURM_JOB_ID).
When running under Ray Train: ray.train.run_name, ray.submission_id, ray.metadata.*.
Span is rotated every 10 minutes: current span is finished and flushed, a fresh continuation opens. Rotated spans carry _dd.was_long_running=1.
When the Datadog C tracer (dd-trace-c) is present via LD_PRELOAD, all GPU-level root spans it creates automatically become children of pytorch.rank via dd_set_global_parent_context / dd_clear_global_parent_context.
No new configuration required. Enable with DD_PATCH_MODULES=pytorch:true. DD_TRACE_PYTORCH_ENABLED and DD_PYTORCH_JOB_ID are registered in the configuration registry.

Validation

99 unit tests pass locally (scripts/run-tests --venv 116b0b8)
Configuration Registry check passes (python scripts/supported_configurations.py --check)
pr_name_lint passes
ast-grep rules pass (uses env.get() not os.environ.get())
CI runs full torch 2.0–2.3 matrix (Python 3.9–3.12) — in progress
Verify pytorch.rank span appears in Datadog UI when running a distributed training job with DD_PATCH_MODULES=pytorch:true

Files

Path	What
`ddtrace/contrib/internal/pytorch/`	New integration (7 source files + patch entry)
`ddtrace/_monkey.py`	`pytorch: False` (opt-in only)
`ddtrace/internal/settings/_config.py`	`pytorch` in `INTEGRATION_CONFIGS`
`supported-configurations.json` + `_supported_configurations.py`	`DD_TRACE_PYTORCH_ENABLED`, `DD_PYTORCH_JOB_ID`, `DD_PYTORCH_SERVICE`
`riotfile.py` + `tests/contrib/suitespec.yml`	pytorch venv (torch 2.0–2.3, Python 3.9–3.12)
`.riot/requirements/`	Lockfiles for all pytorch venv combinations
`tests/contrib/pytorch/`	99 tests
`releasenotes/notes/pytorch-rank-span-*.yaml`	Release note

datadog-prod-us1-6 · 2026-06-03T18:03:19Z

Tests

✨ Fix all issues with BitsAI or with Cursor

⚠️ Warnings

🚦 8 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-py | build linux serverless: [amd64, cp315-cp315, v113741238-d2b8243-manylinux2014_x86_64, 1]

DataDog/apm-reliability/dd-trace-py | build linux serverless: [amd64, cp315-cp315, v113741491-d2b8243-musllinux_1_2_x86_64, 1]

DataDog/apm-reliability/dd-trace-py | build linux serverless: [arm64, cp315-cp315, v113741357-d2b8243-manylinux2014_aarch64, 1]

View all 8 failed jobs.

❄️ 1 New flaky test detected

test_rotation_tags_old_span_was_long_running[py3.11] from test_long_running_span.py

(Fix with Cursor)

assert None == 1
 &#43;  where None = &lt;built-in method _get_numeric_attribute of Span object at 0x7b63d85060c0&gt;(&#39;_dd.was_long_running&#39;)
 &#43;    where &lt;built-in method _get_numeric_attribute of Span object at 0x7b63d85060c0&gt; = Span(name=&#39;pytorch.rank&#39;, span_id=17071592593779704320, parent_id=None, trace_id=1411216454804671177739239564532773709..._priority_v1&#39;: 2}, _span_links=[], _baggage={}, _is_remote=False), service_entry_span_name=pytorch.rank), metastruct={}.get_metric

New test introduced in this PR is flaky.

View in Flaky Test Management

ℹ️ Info

No other issues found (see more)

🧪 All tests passed

Useful? React with 👍 / 👎

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 03f2c01 | Docs | Datadog PR Page | Give us feedback!}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3e8a50a2f2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

cit-pr-commenter-54b7da · 2026-06-03T18:07:37Z

Codeowners resolved as

ddtrace/contrib/internal/pytorch/_distributed.py                        @DataDog/apm-core-python @DataDog/apm-idm-python

pr-commenter · 2026-06-03T18:39:25Z

Benchmarks

Benchmark execution time: 2026-06-11 22:17:23

Comparing candidate commit 03f2c01 in PR branch kr-igor/pytorch-rank-span with baseline commit 4ce48d8 in branch main.

Found 0 performance improvements and 3 performance regressions! Performance is the same for 614 metrics, 10 unstable metrics.

scenario:iastaspects-index_aspect

🟥 execution_time [+15.793µs; +18.000µs] or [+12.760%; +14.543%]

scenario:iastaspectsospath-ospathbasename_aspect

🟥 execution_time [+94.211µs; +101.960µs] or [+21.345%; +23.100%]

scenario:span-start

🟥 execution_time [+1.446ms; +1.620ms] or [+9.451%; +10.593%]

brettlangdon

initial feedback, I haven't had a chance to go through the pytorch integration + tests yet

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7304981840

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

…time span Adds a pytorch integration that instruments distributed PyTorch training jobs by wrapping torch.distributed.init_process_group / destroy_process_group to open and close a pytorch.rank lifetime span per worker process. Key features: - pytorch.rank span carries rank, world_size, training_job.id (from RAY_JOB_ID, TORCHELASTIC_RUN_ID, KUBEFLOW_TRAINING_JOB_ID, SLURM_JOB_ID, or DD_PYTORCH_JOB_ID), device info (GPU UUID via pynvml / torch props), and framework tag (ddp / fsdp / deepspeed) - Cross-rank correlation via shared training_job.id tag - Ray Train integration: propagates _DD_RAY_RUN_METADATA JSON env var into ray.submission_id and ray.metadata.job_name span tags - Fork-safe: at-fork hooks reset bootstrap state; _installed preserved to match inherited wrapt wrappers so uninstall() works correctly in child processes - UUID CUDA_VISIBLE_DEVICES support: uses nvmlDeviceGetHandleByUUID when CUDA_VISIBLE_DEVICES contains GPU-/MIG- UUIDs (k8s NVIDIA device plugin) - skip_venv_artifacts + skip_pip_cache in suitespec to avoid torch venv upload timeouts in CI - _DD_RAY_RUN_METADATA type corrected to "json" in supported-configurations.json

…acer Adds dd_training_step_begin/end ctypes bindings in _c_tracer.py and wraps torch.optim.Optimizer.step (gated on DD_TRAINING_STEP_PROFILING=true) so the C tracer gets precise step boundaries for Layer 2 per-step spans. Also fires step_begin after distributed bootstrap and step_end before the rank-root span closes, bounding the first and last steps of training.

…s.json

…tributeError PropertyMock on type(lib) does not override MagicMock's auto-attribute creation. Using spec= restricts the mock to only declared attributes, so accessing dd_training_step_begin/end raises AttributeError as intended.

The ctypes.CDLL mock approach was unreliable: accessing an attribute on a MagicMock used as return_value did not raise AttributeError as intended. Replace with direct module-state injection (same pattern as the existing swallows_exception tests): set _loaded=True, _lib=object(), and _step_begin/end_fn to None or a Mock. This tests the actual behavior of step_begin/end without depending on _load() internals.

- _c_tracer.py: add service tag to the context snapshot passed to C tracer; reads ddtrace.config.service (i.e. DD_SERVICE) at call time - _rank_root.py: pytorch.rank span now uses ddtrace.config.service when set, falling back to ext_service("pytorch") — inherits Ray service name - test_c_tracer.py: update tag count assertion from 4 to 5 and verify the service key is present in the payload

…D_SERVICE ext_service doesn't check the global service; int_service does, and it already guards against using the inferred-base-service (auto-derived module name), so DD_SERVICE set explicitly by the user (e.g. a Ray job service) propagates correctly while the default 'pytorch' is preserved when DD_SERVICE is unset.

…onfig.service ddtrace.config.service can be empty or the inferred module name; the pytorch.rank span's actual service (set by int_service, which picks up DD_SERVICE correctly) is the authoritative value to pass through.

…ning_job_id

brettlangdon · 2026-06-11T07:42:51Z

+
+def install() -> None:
+    global _installed
+    with _install_lock:


why do we need a lock and global installed marker?

isn't this managed/called by patch() so a single entrypoint that is GIL bound and already handles the "is installed" bit ?

_installed guards the post-import hook callbacks for FSDP/DeepSpeed. Wrapt has no deregister API, so without it a late import torch.distributed.fsdp after unpatch() would re-wrap the constructor. Happy to simplify by reusing torch._datadog_patch directly in the callbacks instead of a separate flag, if you prefer it.

brettlangdon · 2026-06-11T07:48:25Z

 .analysis/
-
-# file created when running scripts/lint
-uv.lock


we don't want to remove this .gitignore, can we add it back?

…straint - Re-add tests/contrib/pytorch/__init__.py (lost during squash) so that PatchTestCase subprocess re-runs can resolve the test module by name - Remove child_of=None from _build_span so pytorch.rank inherits the active context (e.g. nests under ray.train.worker when Ray is active) - Simplify supported version constraint from >=2.0,<3.0 to >=2.0 so the integration registry test (which uses Specifier, not SpecifierSet) can parse it; upper bound is enforced at runtime in patch() anyway

…arenting Two test fixes: - patch.py: restore `TORCH_VERSION >= (3, 0, 0)` guard that was dropped when simplifying _supported_versions() to `">=2.0"` — the registry string only needs the lower bound but the runtime check still enforces <3.0. - _rank_root._build_span: pass child_of=tracer.current_span() when framework is "ray" so pytorch.rank nests under the active ray.train.worker span; activate=False alone does not inherit the active context.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5ed0af7767

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T21:23:32Z

+    if not hasattr(torch.optim, "Optimizer"):
+        return
+    try:
+        _wrap("torch.optim", "Optimizer.step", _wrapped_optimizer_step)


Wrap concrete optimizer step implementations

With DD_TRAINING_STEP_PROFILING enabled, this wraps only torch.optim.Optimizer.step, but common PyTorch optimizers such as SGD/Adam/AdamW define their own step methods, so Python method resolution calls the subclass implementation and never reaches this wrapper. In those normal training loops, the C tracer only gets the initial step_begin() from bootstrap and never receives per-step step_end()/step_begin() signals, making the opt-in step profiling ineffective for standard optimizers.

Useful? React with 👍 / 👎.

This is fine in the initial version. Will update in a followup PR

chatgpt-codex-connector · 2026-06-11T21:23:32Z

+    _cached_distributed_backend = None
+    _fsdp_hook_registered = False
+    _deepspeed_hook_registered = False
+    _optimizer_wrapped = False


Preserve optimizer wrapper state after fork

In a forked child with DD_TRAINING_STEP_PROFILING=true, the Optimizer.step wrapt wrapper installed in the parent is inherited, but this reset marks _optimizer_wrapped as False. If the child then calls ddtrace.unpatch(pytorch=True), _uninstall_optimizer_step() returns early on that flag and leaves the inherited optimizer wrapper active after unpatch; fresh evidence in this version is that _installed is no longer cleared after fork, but the optimizer-specific installed flag still is.

Useful? React with 👍 / 👎.

kr-igor requested review from a team as code owners June 3, 2026 18:02

kr-igor requested review from juanjux, rachelyangdog and wantsui June 3, 2026 18:02

chatgpt-codex-connector Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread ddtrace/contrib/internal/pytorch/_distributed.py Outdated

kr-igor changed the title ~~feat(pytorch): add always-on pytorch.rank lifetime span for distributed training (L0)~~ feat(pytorch): instrument distributed training with pytorch.rank lifetime span Jun 3, 2026

kr-igor marked this pull request as draft June 3, 2026 19:40

kr-igor force-pushed the kr-igor/pytorch-rank-span branch 16 times, most recently from cb263bb to 00c7495 Compare June 4, 2026 18:19

brettlangdon reviewed Jun 8, 2026

View reviewed changes

Comment thread scripts/gen_gitlab_config.py

Comment thread .gitignore

Comment thread riotfile.py

Comment thread riotfile.py

Comment thread tests/contrib/suitespec.yml Outdated

Comment thread tests/suitespec.yml Outdated

Comment thread tests/contrib/pytorch/conftest.py

kr-igor force-pushed the kr-igor/pytorch-rank-span branch 2 times, most recently from 4493daf to a8e0346 Compare June 9, 2026 14:39

kr-igor marked this pull request as ready for review June 9, 2026 17:38

chatgpt-codex-connector Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread ddtrace/contrib/internal/pytorch/_distributed.py Outdated

Comment thread ddtrace/contrib/internal/pytorch/_device.py

wantsui reviewed Jun 9, 2026

View reviewed changes

Comment thread ddtrace/_monkey.py

kr-igor force-pushed the kr-igor/pytorch-rank-span branch from 5c02ace to 89cc563 Compare June 10, 2026 14:31

kr-igor added 10 commits June 10, 2026 13:50

chore: register DD_TRAINING_STEP_PROFILING in supported-configuration…

8a682cb

…s.json

style: ruff format _distributed.py

204cb42

fix: remove unused ext_service import from _rank_root.py

2894d1d

feat(pytorch): detach pytorch.rank from Ray trace; correlate via trai…

14be5c1

…ning_job_id

brettlangdon reviewed Jun 11, 2026

View reviewed changes

kr-igor force-pushed the kr-igor/pytorch-rank-span branch from 4e27e64 to f6bde19 Compare June 11, 2026 18:13

refactor(pytorch): refactoring / cleanup

034d2f5

kr-igor force-pushed the kr-igor/pytorch-rank-span branch from b9944a2 to 034d2f5 Compare June 11, 2026 19:10

kr-igor marked this pull request as draft June 11, 2026 19:20

kr-igor force-pushed the kr-igor/pytorch-rank-span branch from 400fd44 to 06c6e1b Compare June 11, 2026 20:37

kr-igor force-pushed the kr-igor/pytorch-rank-span branch from 06c6e1b to 5ed0af7 Compare June 11, 2026 21:00

kr-igor marked this pull request as ready for review June 11, 2026 21:20

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

fix(pytorch): include DD_PYTORCH_JOB_ID in no-job-id warning message

03f2c01

Conversation

kr-igor commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Files

Uh oh!

datadog-prod-us1-6 Bot commented Jun 3, 2026 • edited by datadog-datadog-prod-us1-2 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

ℹ️ Info

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

cit-pr-commenter-54b7da Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codeowners resolved as

Uh oh!

pr-commenter Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

scenario:iastaspects-index_aspect

scenario:iastaspectsospath-ospathbasename_aspect

scenario:span-start

Uh oh!

brettlangdon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brettlangdon Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

kr-igor Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brettlangdon Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

kr-igor Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kr-igor commented Jun 3, 2026 •

edited

Loading

datadog-prod-us1-6 Bot commented Jun 3, 2026 •

edited by datadog-datadog-prod-us1-2 Bot

Loading

cit-pr-commenter-54b7da Bot commented Jun 3, 2026 •

edited

Loading

pr-commenter Bot commented Jun 3, 2026 •

edited

Loading

kr-igor Jun 11, 2026 •

edited

Loading