Skip to content

Fix ROCm HtoD memcpy stream attribution#1398

Closed
sanrise wants to merge 1 commit into
pytorch:mainfrom
sanrise:export-D103952982
Closed

Fix ROCm HtoD memcpy stream attribution#1398
sanrise wants to merge 1 commit into
pytorch:mainfrom
sanrise:export-D103952982

Conversation

@sanrise
Copy link
Copy Markdown
Contributor

@sanrise sanrise commented May 11, 2026

Summary:
Problem
AMD traces can render HtoD memcpy activity as if every copy happened on stream 0, even while the nearby GPU work is clearly using real non-zero streams. That makes the trace misleading for the main thing users are trying to understand: whether host-to-device copies overlap with active GPU work, and which stream context issued the copy.

Why
The ROCm async memory copy record can arrive without a useful queue id, so Kineto receives the copy as queue 0. At the same time, the HIP runtime stream context around the same work can still identify the real stream that issued the operation. Before this change, Kineto published the unusable queue 0 value directly, so the viewer collapsed those HtoD copies onto stream 0.

Fix
This change repairs only the safe case. The shared ROCm stream/queue helper learns an unambiguous HIP stream to non-zero ROCm queue mapping from correlated GPU activity, then uses that mapping to backfill HtoD memcpy rows that would otherwise render as stream 0. Existing non-zero memcpy queues are preserved, and ambiguous mappings stay on stream 0 instead of guessing.

Caveat [from Michael Wootton]
This is a Kineto rendering/backfill fix, not proof that an SDMA copy physically executed on the compute queue shown in the UI. Some memory copies may be serviced by SDMA without a queue id, and some may be serviced by a blit kernel. Mapping a queue-less copy onto the HIP stream's GPU queue is therefore virtual attribution: useful for making the trace readable and preserving stream-context overlap, but not a replacement for ROCm reporting an explicit copy queue or for a future dedicated pseudo-queue representation for queue-less copies.

Reviewed By: scotts

Differential Revision: D103952982

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 11, 2026

@sanrise has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103952982.

@sanrise
Copy link
Copy Markdown
Contributor Author

sanrise commented May 11, 2026

@mwootton please take a look.

@mwootton
Copy link
Copy Markdown
Contributor

That looks pretty straightforward. Just a little light remapping. If the test passes, it seems fine to me.

Summary:
**Problem**
AMD traces can render HtoD memcpy activity as if every copy happened on stream 0, even while the nearby GPU work is clearly using real non-zero streams. That makes the trace misleading for the main thing users are trying to understand: whether host-to-device copies overlap with active GPU work, and which stream context issued the copy.

**Why**
The ROCm async memory copy record can arrive without a useful queue id, so Kineto receives the copy as queue 0. At the same time, the HIP runtime stream context around the same work can still identify the real stream that issued the operation. Before this change, Kineto published the unusable queue 0 value directly, so the viewer collapsed those HtoD copies onto stream 0.

**Fix**
This change repairs only the safe case. The shared ROCm stream/queue helper learns an unambiguous HIP stream to non-zero ROCm queue mapping from correlated GPU activity, then uses that mapping to backfill HtoD memcpy rows that would otherwise render as stream 0. Existing non-zero memcpy queues are preserved, and ambiguous mappings stay on stream 0 instead of guessing.

**Caveat [from Michael Wootton]**
This is a Kineto rendering/backfill fix, not proof that an SDMA copy physically executed on the compute queue shown in the UI. Some memory copies may be serviced by SDMA without a queue id, and some may be serviced by a blit kernel. Mapping a queue-less copy onto the HIP stream's GPU queue is therefore virtual attribution: useful for making the trace readable and preserving stream-context overlap, but not a replacement for ROCm reporting an explicit copy queue or for a future dedicated pseudo-queue representation for queue-less copies.

Reviewed By: scotts

Differential Revision: D103952982
@sanrise sanrise force-pushed the export-D103952982 branch from d587923 to 555140c Compare May 15, 2026 21:32
@meta-codesync meta-codesync Bot closed this in 799b5f4 May 15, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 15, 2026

This pull request has been merged in 799b5f4.

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request May 22, 2026
Includes the following commits:

- ci: declare workflow-level `contents: read` on 5 workflows (pytorch/kineto#1404) 5902263
- Remove deprecated `REQUEST_TIMESTAMP` config key (pytorch/kineto#1409) 55883de
- Fix intermittent Mac CI failure from conda channel reset (pytorch/kineto#1407) ee27b5c
- Add nlohmann/json as a top-level third_party submodule (pytorch/kineto#1406) c044281
- Remove SIGUSR2 on-demand profiling path (pytorch/kineto#1408) 471ed38
- Fix ROCm HtoD memcpy stream attribution (pytorch/kineto#1398) 799b5f4
- Fix UST_LOGGER_MARK_COMPLETED build failure in manifold_trace_logger (pytorch/kineto#1389) 60967ce
- Remove `DefaultTimeConverterIsIdentity` test (pytorch/kineto#1401) 81d31cd
- Re-enable most PyTorch tests (pytorch/kineto#1403) 212f9a5
- Daily `arc lint --take CLANGFORMAT` (pytorch/kineto#1402) 6481fac
- Resolve CUPTI cbid names via cuptiGetCallbackName (pytorch/kineto#1400) e07e121
- XPUPTI: Fix ts=0 trace events on Windows (pytorch/kineto#1381) 4c8d01c
- Remove LIBKINETO_NO* compatibility shim (pytorch/kineto#1399) ea8bc18
- Upgrade Kineto to C++20 (pytorch/kineto#1397) 77e2b46
- Update the rocm api filtering (pytorch/kineto#1392) e0ac578
Pull Request resolved: #184784
Approved by: https://github.com/NicolasHug, https://github.com/malfet
pytorchmergebot pushed a commit to khushi-411/pytorch that referenced this pull request May 24, 2026
Includes the following commits:

- ci: declare workflow-level `contents: read` on 5 workflows (pytorch/kineto#1404) 5902263
- Remove deprecated `REQUEST_TIMESTAMP` config key (pytorch/kineto#1409) 55883de
- Fix intermittent Mac CI failure from conda channel reset (pytorch/kineto#1407) ee27b5c
- Add nlohmann/json as a top-level third_party submodule (pytorch/kineto#1406) c044281
- Remove SIGUSR2 on-demand profiling path (pytorch/kineto#1408) 471ed38
- Fix ROCm HtoD memcpy stream attribution (pytorch/kineto#1398) 799b5f4
- Fix UST_LOGGER_MARK_COMPLETED build failure in manifold_trace_logger (pytorch/kineto#1389) 60967ce
- Remove `DefaultTimeConverterIsIdentity` test (pytorch/kineto#1401) 81d31cd
- Re-enable most PyTorch tests (pytorch/kineto#1403) 212f9a5
- Daily `arc lint --take CLANGFORMAT` (pytorch/kineto#1402) 6481fac
- Resolve CUPTI cbid names via cuptiGetCallbackName (pytorch/kineto#1400) e07e121
- XPUPTI: Fix ts=0 trace events on Windows (pytorch/kineto#1381) 4c8d01c
- Remove LIBKINETO_NO* compatibility shim (pytorch/kineto#1399) ea8bc18
- Upgrade Kineto to C++20 (pytorch/kineto#1397) 77e2b46
- Update the rocm api filtering (pytorch/kineto#1392) e0ac578
Pull Request resolved: pytorch#184784
Approved by: https://github.com/NicolasHug, https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants