Skip to content

Fix ROCm HtoD memcpy stream attribution#1398

Closed
sanrise wants to merge 1 commit into
pytorch:mainfrom
sanrise:export-D103952982
Closed

Fix ROCm HtoD memcpy stream attribution#1398
sanrise wants to merge 1 commit into
pytorch:mainfrom
sanrise:export-D103952982

Conversation

@sanrise
Copy link
Copy Markdown
Contributor

@sanrise sanrise commented May 11, 2026

Summary:
Problem
AMD traces can render HtoD memcpy activity as if every copy happened on stream 0, even while the nearby GPU work is clearly using real non-zero streams. That makes the trace misleading for the main thing users are trying to understand: whether host-to-device copies overlap with active GPU work, and which stream context issued the copy.

Why
The ROCm async memory copy record can arrive without a useful queue id, so Kineto receives the copy as queue 0. At the same time, the HIP runtime stream context around the same work can still identify the real stream that issued the operation. Before this change, Kineto published the unusable queue 0 value directly, so the viewer collapsed those HtoD copies onto stream 0.

Fix
This change repairs only the safe case. The shared ROCm stream/queue helper learns an unambiguous HIP stream to non-zero ROCm queue mapping from correlated GPU activity, then uses that mapping to backfill HtoD memcpy rows that would otherwise render as stream 0. Existing non-zero memcpy queues are preserved, and ambiguous mappings stay on stream 0 instead of guessing.

Caveat [from Michael Wootton]
This is a Kineto rendering/backfill fix, not proof that an SDMA copy physically executed on the compute queue shown in the UI. Some memory copies may be serviced by SDMA without a queue id, and some may be serviced by a blit kernel. Mapping a queue-less copy onto the HIP stream's GPU queue is therefore virtual attribution: useful for making the trace readable and preserving stream-context overlap, but not a replacement for ROCm reporting an explicit copy queue or for a future dedicated pseudo-queue representation for queue-less copies.

Reviewed By: scotts

Differential Revision: D103952982

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 11, 2026

@sanrise has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103952982.

@sanrise
Copy link
Copy Markdown
Contributor Author

sanrise commented May 11, 2026

@mwootton please take a look.

@mwootton
Copy link
Copy Markdown
Contributor

That looks pretty straightforward. Just a little light remapping. If the test passes, it seems fine to me.

Summary:
**Problem**
AMD traces can render HtoD memcpy activity as if every copy happened on stream 0, even while the nearby GPU work is clearly using real non-zero streams. That makes the trace misleading for the main thing users are trying to understand: whether host-to-device copies overlap with active GPU work, and which stream context issued the copy.

**Why**
The ROCm async memory copy record can arrive without a useful queue id, so Kineto receives the copy as queue 0. At the same time, the HIP runtime stream context around the same work can still identify the real stream that issued the operation. Before this change, Kineto published the unusable queue 0 value directly, so the viewer collapsed those HtoD copies onto stream 0.

**Fix**
This change repairs only the safe case. The shared ROCm stream/queue helper learns an unambiguous HIP stream to non-zero ROCm queue mapping from correlated GPU activity, then uses that mapping to backfill HtoD memcpy rows that would otherwise render as stream 0. Existing non-zero memcpy queues are preserved, and ambiguous mappings stay on stream 0 instead of guessing.

**Caveat [from Michael Wootton]**
This is a Kineto rendering/backfill fix, not proof that an SDMA copy physically executed on the compute queue shown in the UI. Some memory copies may be serviced by SDMA without a queue id, and some may be serviced by a blit kernel. Mapping a queue-less copy onto the HIP stream's GPU queue is therefore virtual attribution: useful for making the trace readable and preserving stream-context overlap, but not a replacement for ROCm reporting an explicit copy queue or for a future dedicated pseudo-queue representation for queue-less copies.

Reviewed By: scotts

Differential Revision: D103952982
@sanrise sanrise force-pushed the export-D103952982 branch from d587923 to 555140c Compare May 15, 2026 21:32
@meta-codesync meta-codesync Bot closed this in 799b5f4 May 15, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 15, 2026

This pull request has been merged in 799b5f4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants