Fix ROCm HtoD memcpy stream attribution#1398
Closed
sanrise wants to merge 1 commit into
Closed
Conversation
|
@sanrise has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103952982. |
Contributor
Author
|
@mwootton please take a look. |
Contributor
|
That looks pretty straightforward. Just a little light remapping. If the test passes, it seems fine to me. |
Summary: **Problem** AMD traces can render HtoD memcpy activity as if every copy happened on stream 0, even while the nearby GPU work is clearly using real non-zero streams. That makes the trace misleading for the main thing users are trying to understand: whether host-to-device copies overlap with active GPU work, and which stream context issued the copy. **Why** The ROCm async memory copy record can arrive without a useful queue id, so Kineto receives the copy as queue 0. At the same time, the HIP runtime stream context around the same work can still identify the real stream that issued the operation. Before this change, Kineto published the unusable queue 0 value directly, so the viewer collapsed those HtoD copies onto stream 0. **Fix** This change repairs only the safe case. The shared ROCm stream/queue helper learns an unambiguous HIP stream to non-zero ROCm queue mapping from correlated GPU activity, then uses that mapping to backfill HtoD memcpy rows that would otherwise render as stream 0. Existing non-zero memcpy queues are preserved, and ambiguous mappings stay on stream 0 instead of guessing. **Caveat [from Michael Wootton]** This is a Kineto rendering/backfill fix, not proof that an SDMA copy physically executed on the compute queue shown in the UI. Some memory copies may be serviced by SDMA without a queue id, and some may be serviced by a blit kernel. Mapping a queue-less copy onto the HIP stream's GPU queue is therefore virtual attribution: useful for making the trace readable and preserving stream-context overlap, but not a replacement for ROCm reporting an explicit copy queue or for a future dedicated pseudo-queue representation for queue-less copies. Reviewed By: scotts Differential Revision: D103952982
d587923 to
555140c
Compare
|
This pull request has been merged in 799b5f4. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Problem
AMD traces can render HtoD memcpy activity as if every copy happened on stream 0, even while the nearby GPU work is clearly using real non-zero streams. That makes the trace misleading for the main thing users are trying to understand: whether host-to-device copies overlap with active GPU work, and which stream context issued the copy.
Why
The ROCm async memory copy record can arrive without a useful queue id, so Kineto receives the copy as queue 0. At the same time, the HIP runtime stream context around the same work can still identify the real stream that issued the operation. Before this change, Kineto published the unusable queue 0 value directly, so the viewer collapsed those HtoD copies onto stream 0.
Fix
This change repairs only the safe case. The shared ROCm stream/queue helper learns an unambiguous HIP stream to non-zero ROCm queue mapping from correlated GPU activity, then uses that mapping to backfill HtoD memcpy rows that would otherwise render as stream 0. Existing non-zero memcpy queues are preserved, and ambiguous mappings stay on stream 0 instead of guessing.
Caveat [from Michael Wootton]
This is a Kineto rendering/backfill fix, not proof that an SDMA copy physically executed on the compute queue shown in the UI. Some memory copies may be serviced by SDMA without a queue id, and some may be serviced by a blit kernel. Mapping a queue-less copy onto the HIP stream's GPU queue is therefore virtual attribution: useful for making the trace readable and preserving stream-context overlap, but not a replacement for ROCm reporting an explicit copy queue or for a future dedicated pseudo-queue representation for queue-less copies.
Reviewed By: scotts
Differential Revision: D103952982