Skip to content

[review-only] PR1: fix(rocm): attribute per-stream GPU activities under rocprofiler-sdk#1

Draft
ajassani wants to merge 1 commit into
mainfrom
pr1/stream-mapping-rebased
Draft

[review-only] PR1: fix(rocm): attribute per-stream GPU activities under rocprofiler-sdk#1
ajassani wants to merge 1 commit into
mainfrom
pr1/stream-mapping-rebased

Conversation

@ajassani
Copy link
Copy Markdown
Owner

Fork-only PR for Adeem's review. Do not merge — this PR exists so we can
discuss inline before sending to pytorch/kineto:main.

Heads-up before review

1. Conflict with recently-merged upstream PR pytorch#1398

#1398 "Fix ROCm HtoD memcpy stream attribution"
landed on 2026-05-15 by Darshan Sanghani (Meta). It introduces a new
libkineto/src/RocmStreamQueue.h helper with a StreamQueueMaps struct that
backfills HtoD memcpy queue ids post-hoc. It does not touch kernel
rows, so our PR1 kernel fix is still functionally needed, but:

Before sending upstream we should either (a) refactor to plug into the new
RocmStreamQueue.h infrastructure, or (b) keep the inline approach and
justify why both mechanisms are needed (kernels need callback-time capture;
backfill won't work because the kernel record can be the only source of the
correct stream-to-queue mapping).

2. Other relevant upstream activity

  • #1397 — Kineto is now C++20.
  • #1388 / #1399LIBKINETO_NO* flags removed; replaced by KINETO_BACKEND. PR3 currently carries a "shim" commit that re-adds the legacy flags for our local PyTorch dev tree; that commit must be dropped before going upstream (separate PyTorch PR needed).

3. Lint pass not yet run

Upstream uses lintrunner (CLANGFORMAT + NEWLINE + SPACES + TABS). I have
not run it locally yet. Will do before upstream submission.


Original PR body (this is what will go upstream)

fix(rocm): attribute per-stream GPU activities under the rocprofiler-sdk backend

Problem

Under the new rocprofiler-sdk backend, the "stream" column of every GPU
activity row is sourced from record->dispatch.queue_id.handle. That field
is the HSA queue id — allocated per HIP-internal queue pool, not per
HIP stream — so the values exposed in the trace are:

  • Wrong frame of reference: the value is not a hipStream_t and does not
    match what user code created via torch.cuda.Stream().
  • Drift from the legacy roctracer backend: the older roctracer path
    emits stream=0 for the default stream and small sequential ids for
    others. The new backend emits arbitrary HSA queue handles (e.g.
    stream=1 for default, stream=4 for RCCL on the very first DDP run).
    Traces from the two backends for the same workload no longer line up.
  • Latent collapse risk: in workloads with enough HIP streams that HIP
    consolidates them onto a single HSA queue pool, multiple HIP streams
    end up on a single "stream" row in the trace.

Why this needs to land now

The PyTorch ROCm wheel currently pins a Kineto SHA that predates
rocprofiler-sdk being the default. As soon as PyTorch bumps its Kineto
submodule pin past the March 2026 default switch, every DDP / multi-stream
ROCm trace exhibits this bug.

Fix

Capture hipStream_t at the API callback phase (where it's available as a
call argument), key it by correlation_id, then look it up in the buffer
callback phase to set the correct queue value on the activity. Entries are
erased after lookup to bound memory.

For human readability, each unique hipStream_t pointer is mapped to a small
sequential integer id (nullptr → 0), matching the convention used by the
roctracer path.

Covers kernel launches, memory copies, memset, and all remaining HIP APIs
that accept a stream parameter.

Tests / evidence

  • Existing RocmActivityProfilerTest suite passes unchanged.

  • 4-rank DDP validation on MI210 (train_ddp_crosscp.py, world_size=4,
    4096×4096 linear layers, bucket_cap_mb=8 → 16 small allreduces per step).
    Captured rank0 traces with both baseline/rocprofsdk-stock (no PR1) and
    this PR applied:

    property baseline-stock with PR1
    distinct GPU rows 2 2
    compute stream id 1 (arbitrary HSA q) 0 (matches default)
    RCCL stream id 4 (arbitrary HSA q) 1
    RCCL kernels on RCCL row 11 of 11 10 of 10
    compute kernels on row 0 n/a (was row 1) 75 / 75
    matches roctracer numbering no yes
  • Single-process workloads with ≤2 HIP streams retain correct attribution
    on both backends — the values just differ. PR1 unifies them.

Backend parity

After this PR, rocprofiler-sdk and roctracer produce functionally equivalent
stream attribution
for the same workload. Same kernels on the same rows,
same arrows.

Out of scope

  • Inter-stream dependency tracking (sync-API target metadata). That is PR2 in
    this series.
  • roctracer-side changes. roctracer was already correct.

…rows

rocprofiler-sdk buffer records expose only HSA queue_id, which is shared
across all HIP streams on a device.  This causes all GPU activities to
appear on a single "stream" in the Chrome trace, making multi-stream
workloads (DDP, communication-compute overlap) impossible to analyze.

Build a correlation_id → hipStream_t map during the API callback phase
(where the actual hipStream_t pointer is available from HIP API args),
then look it up in the buffer callback to set the correct per-stream
queue value.  Entries are erased after consumption to bound memory usage.

Assign small sequential integer IDs to each unique hipStream_t pointer
(nullptr → 0) for human-readable traces, matching the convention used
by the older roctracer backend.

Covers kernel launches, memory copies, memset, and all remaining HIP
APIs that accept a stream parameter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant