[review-only] PR1: fix(rocm): attribute per-stream GPU activities under rocprofiler-sdk#1
Draft
ajassani wants to merge 1 commit into
Draft
[review-only] PR1: fix(rocm): attribute per-stream GPU activities under rocprofiler-sdk#1ajassani wants to merge 1 commit into
ajassani wants to merge 1 commit into
Conversation
…rows rocprofiler-sdk buffer records expose only HSA queue_id, which is shared across all HIP streams on a device. This causes all GPU activities to appear on a single "stream" in the Chrome trace, making multi-stream workloads (DDP, communication-compute overlap) impossible to analyze. Build a correlation_id → hipStream_t map during the API callback phase (where the actual hipStream_t pointer is available from HIP API args), then look it up in the buffer callback to set the correct per-stream queue value. Entries are erased after consumption to bound memory usage. Assign small sequential integer IDs to each unique hipStream_t pointer (nullptr → 0) for human-readable traces, matching the convention used by the older roctracer backend. Covers kernel launches, memory copies, memset, and all remaining HIP APIs that accept a stream parameter.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Heads-up before review
1. Conflict with recently-merged upstream PR pytorch#1398
#1398 "Fix ROCm HtoD memcpy stream attribution"
landed on 2026-05-15 by Darshan Sanghani (Meta). It introduces a new
libkineto/src/RocmStreamQueue.hhelper with aStreamQueueMapsstruct thatbackfills HtoD memcpy queue ids post-hoc. It does not touch kernel
rows, so our PR1 kernel fix is still functionally needed, but:
correlation_id → hipStream_tmaps inRocprofLogger.cppinstead of using the new helperpost-process pass over collected rows
Before sending upstream we should either (a) refactor to plug into the new
RocmStreamQueue.hinfrastructure, or (b) keep the inline approach andjustify why both mechanisms are needed (kernels need callback-time capture;
backfill won't work because the kernel record can be the only source of the
correct stream-to-queue mapping).
2. Other relevant upstream activity
LIBKINETO_NO*flags removed; replaced byKINETO_BACKEND. PR3 currently carries a "shim" commit that re-adds the legacy flags for our local PyTorch dev tree; that commit must be dropped before going upstream (separate PyTorch PR needed).3. Lint pass not yet run
Upstream uses
lintrunner(CLANGFORMAT + NEWLINE + SPACES + TABS). I havenot run it locally yet. Will do before upstream submission.
Original PR body (this is what will go upstream)
fix(rocm): attribute per-stream GPU activities under the rocprofiler-sdk backend
Problem
Under the new
rocprofiler-sdkbackend, the "stream" column of every GPUactivity row is sourced from
record->dispatch.queue_id.handle. That fieldis the HSA queue id — allocated per HIP-internal queue pool, not per
HIP stream — so the values exposed in the trace are:
hipStream_tand does notmatch what user code created via
torch.cuda.Stream().emits
stream=0for the default stream and small sequential ids forothers. The new backend emits arbitrary HSA queue handles (e.g.
stream=1for default,stream=4for RCCL on the very first DDP run).Traces from the two backends for the same workload no longer line up.
consolidates them onto a single HSA queue pool, multiple HIP streams
end up on a single "stream" row in the trace.
Why this needs to land now
The PyTorch ROCm wheel currently pins a Kineto SHA that predates
rocprofiler-sdkbeing the default. As soon as PyTorch bumps its Kinetosubmodule pin past the March 2026 default switch, every DDP / multi-stream
ROCm trace exhibits this bug.
Fix
Capture
hipStream_tat the API callback phase (where it's available as acall argument), key it by
correlation_id, then look it up in the buffercallback phase to set the correct queue value on the activity. Entries are
erased after lookup to bound memory.
For human readability, each unique
hipStream_tpointer is mapped to a smallsequential integer id (
nullptr → 0), matching the convention used by theroctracer path.
Covers kernel launches, memory copies, memset, and all remaining HIP APIs
that accept a stream parameter.
Tests / evidence
Existing
RocmActivityProfilerTestsuite passes unchanged.4-rank DDP validation on MI210 (
train_ddp_crosscp.py, world_size=4,4096×4096 linear layers, bucket_cap_mb=8 → 16 small allreduces per step).
Captured rank0 traces with both
baseline/rocprofsdk-stock(no PR1) andthis PR applied:
1(arbitrary HSA q)0(matches default)4(arbitrary HSA q)1Single-process workloads with ≤2 HIP streams retain correct attribution
on both backends — the values just differ. PR1 unifies them.
Backend parity
After this PR, rocprofiler-sdk and roctracer produce functionally equivalent
stream attribution for the same workload. Same kernels on the same rows,
same arrows.
Out of scope
this series.