Skip to content

feat(pytorch): instrument distributed training with pytorch.rank lifetime span#18449

Open
kr-igor wants to merge 15 commits into
mainfrom
kr-igor/pytorch-rank-span
Open

feat(pytorch): instrument distributed training with pytorch.rank lifetime span#18449
kr-igor wants to merge 15 commits into
mainfrom
kr-igor/pytorch-rank-span

Conversation

@kr-igor

@kr-igor kr-igor commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Patches torch.distributed.init_process_group / destroy_process_group to emit a single pytorch.rank span per rank for the lifetime of the process group.
  • Tags: rank, world_size, framework (DDP / FSDP / DeepSpeed), launcher, torch.distributed.backend, training_job_id (auto-resolved from RAY_JOB_ID / TORCHELASTIC_RUN_ID / KUBEFLOW_TRAINING_JOB_ID / SLURM_JOB_ID).
  • When running under Ray Train: ray.train.run_name, ray.submission_id, ray.metadata.*.
  • Span is rotated every 10 minutes: current span is finished and flushed, a fresh continuation opens. Rotated spans carry _dd.was_long_running=1.
  • When the Datadog C tracer (dd-trace-c) is present via LD_PRELOAD, all GPU-level root spans it creates automatically become children of pytorch.rank via dd_set_global_parent_context / dd_clear_global_parent_context.
  • No new configuration required. Enable with DD_PATCH_MODULES=pytorch:true. DD_TRACE_PYTORCH_ENABLED and DD_PYTORCH_JOB_ID are registered in the configuration registry.

Validation

  • 99 unit tests pass locally (scripts/run-tests --venv 116b0b8)
  • Configuration Registry check passes (python scripts/supported_configurations.py --check)
  • pr_name_lint passes
  • ast-grep rules pass (uses env.get() not os.environ.get())
  • CI runs full torch 2.0–2.3 matrix (Python 3.9–3.12) — in progress
  • Verify pytorch.rank span appears in Datadog UI when running a distributed training job with DD_PATCH_MODULES=pytorch:true

Files

Path What
ddtrace/contrib/internal/pytorch/ New integration (7 source files + patch entry)
ddtrace/_monkey.py pytorch: False (opt-in only)
ddtrace/internal/settings/_config.py pytorch in INTEGRATION_CONFIGS
supported-configurations.json + _supported_configurations.py DD_TRACE_PYTORCH_ENABLED, DD_PYTORCH_JOB_ID, DD_PYTORCH_SERVICE
riotfile.py + tests/contrib/suitespec.yml pytorch venv (torch 2.0–2.3, Python 3.9–3.12)
.riot/requirements/ Lockfiles for all pytorch venv combinations
tests/contrib/pytorch/ 99 tests
releasenotes/notes/pytorch-rank-span-*.yaml Release note

@kr-igor kr-igor requested review from a team as code owners June 3, 2026 18:02
@datadog-prod-us1-6

datadog-prod-us1-6 Bot commented Jun 3, 2026

Copy link
Copy Markdown

Pipelines  Tests

Fix all issues with BitsAI or with Cursor

⚠️ Warnings

🚦 8 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-py | build linux serverless: [amd64, cp315-cp315, v113741238-d2b8243-manylinux2014_x86_64, 1]   View in Datadog   GitLab

DataDog/apm-reliability/dd-trace-py | build linux serverless: [amd64, cp315-cp315, v113741491-d2b8243-musllinux_1_2_x86_64, 1]   View in Datadog   GitLab

DataDog/apm-reliability/dd-trace-py | build linux serverless: [arm64, cp315-cp315, v113741357-d2b8243-manylinux2014_aarch64, 1]   View in Datadog   GitLab

View all 8 failed jobs.

❄️ 1 New flaky test detected

test_rotation_tags_old_span_was_long_running[py3.11] from test_long_running_span.py   View in Datadog (Fix with Cursor)
assert None == 1
 +  where None = <built-in method _get_numeric_attribute of Span object at 0x7b63d85060c0>('_dd.was_long_running')
 +    where <built-in method _get_numeric_attribute of Span object at 0x7b63d85060c0> = Span(name='pytorch.rank', span_id=17071592593779704320, parent_id=None, trace_id=1411216454804671177739239564532773709..._priority_v1': 2}, _span_links=[], _baggage={}, _is_remote=False), service_entry_span_name=pytorch.rank), metastruct={}.get_metric

New test introduced in this PR is flaky.

View in Flaky Test Management

ℹ️ Info

No other issues found (see more)

🧪 All tests passed

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 03f2c01 | Docs | Datadog PR Page | Give us feedback!

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3e8a50a2f2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread ddtrace/contrib/internal/pytorch/_distributed.py Outdated
@cit-pr-commenter-54b7da

cit-pr-commenter-54b7da Bot commented Jun 3, 2026

Copy link
Copy Markdown

Codeowners resolved as

ddtrace/contrib/internal/pytorch/_distributed.py                        @DataDog/apm-core-python @DataDog/apm-idm-python

@kr-igor kr-igor changed the title feat(pytorch): add always-on pytorch.rank lifetime span for distributed training (L0) feat(pytorch): instrument distributed training with pytorch.rank lifetime span Jun 3, 2026
@pr-commenter

pr-commenter Bot commented Jun 3, 2026

Copy link
Copy Markdown

Benchmarks

Benchmark execution time: 2026-06-11 22:17:23

Comparing candidate commit 03f2c01 in PR branch kr-igor/pytorch-rank-span with baseline commit 4ce48d8 in branch main.

Found 0 performance improvements and 3 performance regressions! Performance is the same for 614 metrics, 10 unstable metrics.

scenario:iastaspects-index_aspect

  • 🟥 execution_time [+15.793µs; +18.000µs] or [+12.760%; +14.543%]

scenario:iastaspectsospath-ospathbasename_aspect

  • 🟥 execution_time [+94.211µs; +101.960µs] or [+21.345%; +23.100%]

scenario:span-start

  • 🟥 execution_time [+1.446ms; +1.620ms] or [+9.451%; +10.593%]

@kr-igor kr-igor marked this pull request as draft June 3, 2026 19:40
@kr-igor kr-igor force-pushed the kr-igor/pytorch-rank-span branch 16 times, most recently from cb263bb to 00c7495 Compare June 4, 2026 18:19

@brettlangdon brettlangdon left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial feedback, I haven't had a chance to go through the pytorch integration + tests yet

Comment thread scripts/gen_gitlab_config.py
Comment thread .gitignore
Comment thread riotfile.py
Comment thread riotfile.py
Comment thread tests/contrib/suitespec.yml Outdated
Comment thread tests/suitespec.yml Outdated
Comment thread tests/contrib/pytorch/conftest.py
@kr-igor kr-igor force-pushed the kr-igor/pytorch-rank-span branch 2 times, most recently from 4493daf to a8e0346 Compare June 9, 2026 14:39
@kr-igor kr-igor marked this pull request as ready for review June 9, 2026 17:38

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7304981840

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread ddtrace/contrib/internal/pytorch/_distributed.py Outdated
Comment thread ddtrace/contrib/internal/pytorch/_device.py
Comment thread ddtrace/_monkey.py
…time span

Adds a pytorch integration that instruments distributed PyTorch training jobs
by wrapping torch.distributed.init_process_group / destroy_process_group to
open and close a pytorch.rank lifetime span per worker process.

Key features:
- pytorch.rank span carries rank, world_size, training_job.id (from RAY_JOB_ID,
  TORCHELASTIC_RUN_ID, KUBEFLOW_TRAINING_JOB_ID, SLURM_JOB_ID, or
  DD_PYTORCH_JOB_ID), device info (GPU UUID via pynvml / torch props), and
  framework tag (ddp / fsdp / deepspeed)
- Cross-rank correlation via shared training_job.id tag
- Ray Train integration: propagates _DD_RAY_RUN_METADATA JSON env var into
  ray.submission_id and ray.metadata.job_name span tags
- Fork-safe: at-fork hooks reset bootstrap state; _installed preserved to match
  inherited wrapt wrappers so uninstall() works correctly in child processes
- UUID CUDA_VISIBLE_DEVICES support: uses nvmlDeviceGetHandleByUUID when
  CUDA_VISIBLE_DEVICES contains GPU-/MIG- UUIDs (k8s NVIDIA device plugin)
- skip_venv_artifacts + skip_pip_cache in suitespec to avoid torch venv upload
  timeouts in CI
- _DD_RAY_RUN_METADATA type corrected to "json" in supported-configurations.json
@kr-igor kr-igor force-pushed the kr-igor/pytorch-rank-span branch from 5c02ace to 89cc563 Compare June 10, 2026 14:31
kr-igor added 10 commits June 10, 2026 13:50
…acer

Adds dd_training_step_begin/end ctypes bindings in _c_tracer.py and wraps
torch.optim.Optimizer.step (gated on DD_TRAINING_STEP_PROFILING=true) so
the C tracer gets precise step boundaries for Layer 2 per-step spans.

Also fires step_begin after distributed bootstrap and step_end before the
rank-root span closes, bounding the first and last steps of training.
…tributeError

PropertyMock on type(lib) does not override MagicMock's auto-attribute
creation. Using spec= restricts the mock to only declared attributes,
so accessing dd_training_step_begin/end raises AttributeError as intended.
The ctypes.CDLL mock approach was unreliable: accessing an attribute on a
MagicMock used as return_value did not raise AttributeError as intended.

Replace with direct module-state injection (same pattern as the existing
swallows_exception tests): set _loaded=True, _lib=object(), and
_step_begin/end_fn to None or a Mock. This tests the actual behavior of
step_begin/end without depending on _load() internals.
- _c_tracer.py: add service tag to the context snapshot passed to C
  tracer; reads ddtrace.config.service (i.e. DD_SERVICE) at call time
- _rank_root.py: pytorch.rank span now uses ddtrace.config.service when
  set, falling back to ext_service("pytorch") — inherits Ray service name
- test_c_tracer.py: update tag count assertion from 4 to 5 and verify
  the service key is present in the payload
…D_SERVICE

ext_service doesn't check the global service; int_service does, and it
already guards against using the inferred-base-service (auto-derived
module name), so DD_SERVICE set explicitly by the user (e.g. a Ray job
service) propagates correctly while the default 'pytorch' is preserved
when DD_SERVICE is unset.
…onfig.service

ddtrace.config.service can be empty or the inferred module name; the
pytorch.rank span's actual service (set by int_service, which picks up
DD_SERVICE correctly) is the authoritative value to pass through.

def install() -> None:
global _installed
with _install_lock:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a lock and global installed marker?

isn't this managed/called by patch() so a single entrypoint that is GIL bound and already handles the "is installed" bit ?

@kr-igor kr-igor Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_installed guards the post-import hook callbacks for FSDP/DeepSpeed. Wrapt has no deregister API, so without it a late import torch.distributed.fsdp after unpatch() would re-wrap the constructor. Happy to simplify by reusing torch._datadog_patch directly in the callbacks instead of a separate flag, if you prefer it.

Comment thread .gitignore Outdated
.analysis/

# file created when running scripts/lint
uv.lock No newline at end of file

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want to remove this .gitignore, can we add it back?

@kr-igor kr-igor force-pushed the kr-igor/pytorch-rank-span branch from 4e27e64 to f6bde19 Compare June 11, 2026 18:13
@kr-igor kr-igor force-pushed the kr-igor/pytorch-rank-span branch from b9944a2 to 034d2f5 Compare June 11, 2026 19:10
@kr-igor kr-igor marked this pull request as draft June 11, 2026 19:20
…straint

- Re-add tests/contrib/pytorch/__init__.py (lost during squash) so that
  PatchTestCase subprocess re-runs can resolve the test module by name
- Remove child_of=None from _build_span so pytorch.rank inherits the
  active context (e.g. nests under ray.train.worker when Ray is active)
- Simplify supported version constraint from >=2.0,<3.0 to >=2.0 so the
  integration registry test (which uses Specifier, not SpecifierSet) can
  parse it; upper bound is enforced at runtime in patch() anyway
@kr-igor kr-igor force-pushed the kr-igor/pytorch-rank-span branch from 400fd44 to 06c6e1b Compare June 11, 2026 20:37
…arenting

Two test fixes:
- patch.py: restore `TORCH_VERSION >= (3, 0, 0)` guard that was dropped when
  simplifying _supported_versions() to `">=2.0"` — the registry string only
  needs the lower bound but the runtime check still enforces <3.0.
- _rank_root._build_span: pass child_of=tracer.current_span() when framework
  is "ray" so pytorch.rank nests under the active ray.train.worker span;
  activate=False alone does not inherit the active context.
@kr-igor kr-igor force-pushed the kr-igor/pytorch-rank-span branch from 06c6e1b to 5ed0af7 Compare June 11, 2026 21:00
@kr-igor kr-igor marked this pull request as ready for review June 11, 2026 21:20

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5ed0af7767

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

if not hasattr(torch.optim, "Optimizer"):
return
try:
_wrap("torch.optim", "Optimizer.step", _wrapped_optimizer_step)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Wrap concrete optimizer step implementations

With DD_TRAINING_STEP_PROFILING enabled, this wraps only torch.optim.Optimizer.step, but common PyTorch optimizers such as SGD/Adam/AdamW define their own step methods, so Python method resolution calls the subclass implementation and never reaches this wrapper. In those normal training loops, the C tracer only gets the initial step_begin() from bootstrap and never receives per-step step_end()/step_begin() signals, making the opt-in step profiling ineffective for standard optimizers.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine in the initial version. Will update in a followup PR

_cached_distributed_backend = None
_fsdp_hook_registered = False
_deepspeed_hook_registered = False
_optimizer_wrapped = False

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve optimizer wrapper state after fork

In a forked child with DD_TRAINING_STEP_PROFILING=true, the Optimizer.step wrapt wrapper installed in the parent is inherited, but this reset marks _optimizer_wrapped as False. If the child then calls ddtrace.unpatch(pytorch=True), _uninstall_optimizer_step() returns early on that flag and leaves the inherited optimizer wrapper active after unpatch; fresh evidence in this version is that _installed is no longer cleared after fork, but the optimizer-specific installed flag still is.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants