[review-only] PR1: fix(rocm): attribute per-stream GPU activities under rocprofiler-sdk by ajassani · Pull Request #1 · ajassani/kineto

ajassani · 2026-05-23T13:21:05Z

Fork-only PR for Adeem's review. Do not merge — this PR exists so we can
discuss inline before sending to pytorch/kineto:main.

Heads-up before review

1. Conflict with recently-merged upstream PR pytorch#1398

#1398 "Fix ROCm HtoD memcpy stream attribution"
landed on 2026-05-15 by Darshan Sanghani (Meta). It introduces a new
libkineto/src/RocmStreamQueue.h helper with a StreamQueueMaps struct that
backfills HtoD memcpy queue ids post-hoc. It does not touch kernel
rows, so our PR1 kernel fix is still functionally needed, but:

our PR1 inlines its own correlation_id → hipStream_t maps in
RocprofLogger.cpp instead of using the new helper
our PR1 fixes attribution at api-callback time, whereas Fix ROCm HtoD memcpy stream attribution pytorch/kineto#1398 does it as a
post-process pass over collected rows

Before sending upstream we should either (a) refactor to plug into the new
RocmStreamQueue.h infrastructure, or (b) keep the inline approach and
justify why both mechanisms are needed (kernels need callback-time capture;
backfill won't work because the kernel record can be the only source of the
correct stream-to-queue mapping).

2. Other relevant upstream activity

#1397 — Kineto is now C++20.
#1388 / #1399 — LIBKINETO_NO* flags removed; replaced by KINETO_BACKEND. PR3 currently carries a "shim" commit that re-adds the legacy flags for our local PyTorch dev tree; that commit must be dropped before going upstream (separate PyTorch PR needed).

3. Lint pass not yet run

Upstream uses lintrunner (CLANGFORMAT + NEWLINE + SPACES + TABS). I have
not run it locally yet. Will do before upstream submission.

Original PR body (this is what will go upstream)

fix(rocm): attribute per-stream GPU activities under the rocprofiler-sdk backend

Problem

Under the new rocprofiler-sdk backend, the "stream" column of every GPU
activity row is sourced from record->dispatch.queue_id.handle. That field
is the HSA queue id — allocated per HIP-internal queue pool, not per
HIP stream — so the values exposed in the trace are:

Wrong frame of reference: the value is not a hipStream_t and does not
match what user code created via torch.cuda.Stream().
Drift from the legacy roctracer backend: the older roctracer path
emits stream=0 for the default stream and small sequential ids for
others. The new backend emits arbitrary HSA queue handles (e.g.
stream=1 for default, stream=4 for RCCL on the very first DDP run).
Traces from the two backends for the same workload no longer line up.
Latent collapse risk: in workloads with enough HIP streams that HIP
consolidates them onto a single HSA queue pool, multiple HIP streams
end up on a single "stream" row in the trace.

Why this needs to land now

The PyTorch ROCm wheel currently pins a Kineto SHA that predates
rocprofiler-sdk being the default. As soon as PyTorch bumps its Kineto
submodule pin past the March 2026 default switch, every DDP / multi-stream
ROCm trace exhibits this bug.

Fix

Capture hipStream_t at the API callback phase (where it's available as a
call argument), key it by correlation_id, then look it up in the buffer
callback phase to set the correct queue value on the activity. Entries are
erased after lookup to bound memory.

For human readability, each unique hipStream_t pointer is mapped to a small
sequential integer id (nullptr → 0), matching the convention used by the
roctracer path.

Covers kernel launches, memory copies, memset, and all remaining HIP APIs
that accept a stream parameter.

Tests / evidence

Existing RocmActivityProfilerTest suite passes unchanged.

4-rank DDP validation on MI210 (train_ddp_crosscp.py, world_size=4,
4096×4096 linear layers, bucket_cap_mb=8 → 16 small allreduces per step).
Captured rank0 traces with both baseline/rocprofsdk-stock (no PR1) and
this PR applied:

property	baseline-stock	with PR1
distinct GPU rows	2	2
compute stream id	`1` (arbitrary HSA q)	`0` (matches default)
RCCL stream id	`4` (arbitrary HSA q)	`1`
RCCL kernels on RCCL row	11 of 11	10 of 10
compute kernels on row 0	n/a (was row 1)	75 / 75
matches roctracer numbering	no	yes

Single-process workloads with ≤2 HIP streams retain correct attribution
on both backends — the values just differ. PR1 unifies them.

Backend parity

After this PR, rocprofiler-sdk and roctracer produce functionally equivalent
stream attribution for the same workload. Same kernels on the same rows,
same arrows.

Out of scope

Inter-stream dependency tracking (sync-API target metadata). That is PR2 in
this series.
roctracer-side changes. roctracer was already correct.

…rows rocprofiler-sdk buffer records expose only HSA queue_id, which is shared across all HIP streams on a device. This causes all GPU activities to appear on a single "stream" in the Chrome trace, making multi-stream workloads (DDP, communication-compute overlap) impossible to analyze. Build a correlation_id → hipStream_t map during the API callback phase (where the actual hipStream_t pointer is available from HIP API args), then look it up in the buffer callback to set the correct per-stream queue value. Entries are erased after consumption to bound memory usage. Assign small sequential integer IDs to each unique hipStream_t pointer (nullptr → 0) for human-readable traces, matching the convention used by the older roctracer backend. Covers kernel launches, memory copies, memset, and all remaining HIP APIs that accept a stream parameter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[review-only] PR1: fix(rocm): attribute per-stream GPU activities under rocprofiler-sdk#1

[review-only] PR1: fix(rocm): attribute per-stream GPU activities under rocprofiler-sdk#1
ajassani wants to merge 1 commit into
mainfrom
pr1/stream-mapping-rebased

ajassani commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ajassani commented May 23, 2026

Heads-up before review

1. Conflict with recently-merged upstream PR pytorch#1398

2. Other relevant upstream activity

3. Lint pass not yet run

Original PR body (this is what will go upstream)

fix(rocm): attribute per-stream GPU activities under the rocprofiler-sdk backend

Problem

Why this needs to land now

Fix

Tests / evidence

Backend parity

Out of scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant