Fix ROCm HtoD memcpy stream attribution by sanrise · Pull Request #1398 · pytorch/kineto

sanrise · 2026-05-11T22:08:33Z

Summary:
Problem
AMD traces can render HtoD memcpy activity as if every copy happened on stream 0, even while the nearby GPU work is clearly using real non-zero streams. That makes the trace misleading for the main thing users are trying to understand: whether host-to-device copies overlap with active GPU work, and which stream context issued the copy.

Why
The ROCm async memory copy record can arrive without a useful queue id, so Kineto receives the copy as queue 0. At the same time, the HIP runtime stream context around the same work can still identify the real stream that issued the operation. Before this change, Kineto published the unusable queue 0 value directly, so the viewer collapsed those HtoD copies onto stream 0.

Fix
This change repairs only the safe case. The shared ROCm stream/queue helper learns an unambiguous HIP stream to non-zero ROCm queue mapping from correlated GPU activity, then uses that mapping to backfill HtoD memcpy rows that would otherwise render as stream 0. Existing non-zero memcpy queues are preserved, and ambiguous mappings stay on stream 0 instead of guessing.

Caveat [from Michael Wootton]
This is a Kineto rendering/backfill fix, not proof that an SDMA copy physically executed on the compute queue shown in the UI. Some memory copies may be serviced by SDMA without a queue id, and some may be serviced by a blit kernel. Mapping a queue-less copy onto the HIP stream's GPU queue is therefore virtual attribution: useful for making the trace readable and preserving stream-context overlap, but not a replacement for ROCm reporting an explicit copy queue or for a future dedicated pseudo-queue representation for queue-less copies.

Reviewed By: scotts

Differential Revision: D103952982

meta-codesync · 2026-05-11T22:08:41Z

@sanrise has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103952982.

sanrise · 2026-05-11T22:11:45Z

@mwootton please take a look.

mwootton · 2026-05-14T02:53:22Z

That looks pretty straightforward. Just a little light remapping. If the test passes, it seems fine to me.

Summary: **Problem** AMD traces can render HtoD memcpy activity as if every copy happened on stream 0, even while the nearby GPU work is clearly using real non-zero streams. That makes the trace misleading for the main thing users are trying to understand: whether host-to-device copies overlap with active GPU work, and which stream context issued the copy. **Why** The ROCm async memory copy record can arrive without a useful queue id, so Kineto receives the copy as queue 0. At the same time, the HIP runtime stream context around the same work can still identify the real stream that issued the operation. Before this change, Kineto published the unusable queue 0 value directly, so the viewer collapsed those HtoD copies onto stream 0. **Fix** This change repairs only the safe case. The shared ROCm stream/queue helper learns an unambiguous HIP stream to non-zero ROCm queue mapping from correlated GPU activity, then uses that mapping to backfill HtoD memcpy rows that would otherwise render as stream 0. Existing non-zero memcpy queues are preserved, and ambiguous mappings stay on stream 0 instead of guessing. **Caveat [from Michael Wootton]** This is a Kineto rendering/backfill fix, not proof that an SDMA copy physically executed on the compute queue shown in the UI. Some memory copies may be serviced by SDMA without a queue id, and some may be serviced by a blit kernel. Mapping a queue-less copy onto the HIP stream's GPU queue is therefore virtual attribution: useful for making the trace readable and preserving stream-context overlap, but not a replacement for ROCm reporting an explicit copy queue or for a future dedicated pseudo-queue representation for queue-less copies. Reviewed By: scotts Differential Revision: D103952982

meta-codesync · 2026-05-15T22:29:30Z

This pull request has been merged in 799b5f4.

Includes the following commits: - ci: declare workflow-level `contents: read` on 5 workflows (pytorch/kineto#1404) 5902263 - Remove deprecated `REQUEST_TIMESTAMP` config key (pytorch/kineto#1409) 55883de - Fix intermittent Mac CI failure from conda channel reset (pytorch/kineto#1407) ee27b5c - Add nlohmann/json as a top-level third_party submodule (pytorch/kineto#1406) c044281 - Remove SIGUSR2 on-demand profiling path (pytorch/kineto#1408) 471ed38 - Fix ROCm HtoD memcpy stream attribution (pytorch/kineto#1398) 799b5f4 - Fix UST_LOGGER_MARK_COMPLETED build failure in manifold_trace_logger (pytorch/kineto#1389) 60967ce - Remove `DefaultTimeConverterIsIdentity` test (pytorch/kineto#1401) 81d31cd - Re-enable most PyTorch tests (pytorch/kineto#1403) 212f9a5 - Daily `arc lint --take CLANGFORMAT` (pytorch/kineto#1402) 6481fac - Resolve CUPTI cbid names via cuptiGetCallbackName (pytorch/kineto#1400) e07e121 - XPUPTI: Fix ts=0 trace events on Windows (pytorch/kineto#1381) 4c8d01c - Remove LIBKINETO_NO* compatibility shim (pytorch/kineto#1399) ea8bc18 - Upgrade Kineto to C++20 (pytorch/kineto#1397) 77e2b46 - Update the rocm api filtering (pytorch/kineto#1392) e0ac578 Pull Request resolved: #184784 Approved by: https://github.com/NicolasHug, https://github.com/malfet

Includes the following commits: - ci: declare workflow-level `contents: read` on 5 workflows (pytorch/kineto#1404) 5902263 - Remove deprecated `REQUEST_TIMESTAMP` config key (pytorch/kineto#1409) 55883de - Fix intermittent Mac CI failure from conda channel reset (pytorch/kineto#1407) ee27b5c - Add nlohmann/json as a top-level third_party submodule (pytorch/kineto#1406) c044281 - Remove SIGUSR2 on-demand profiling path (pytorch/kineto#1408) 471ed38 - Fix ROCm HtoD memcpy stream attribution (pytorch/kineto#1398) 799b5f4 - Fix UST_LOGGER_MARK_COMPLETED build failure in manifold_trace_logger (pytorch/kineto#1389) 60967ce - Remove `DefaultTimeConverterIsIdentity` test (pytorch/kineto#1401) 81d31cd - Re-enable most PyTorch tests (pytorch/kineto#1403) 212f9a5 - Daily `arc lint --take CLANGFORMAT` (pytorch/kineto#1402) 6481fac - Resolve CUPTI cbid names via cuptiGetCallbackName (pytorch/kineto#1400) e07e121 - XPUPTI: Fix ts=0 trace events on Windows (pytorch/kineto#1381) 4c8d01c - Remove LIBKINETO_NO* compatibility shim (pytorch/kineto#1399) ea8bc18 - Upgrade Kineto to C++20 (pytorch/kineto#1397) 77e2b46 - Update the rocm api filtering (pytorch/kineto#1392) e0ac578 Pull Request resolved: pytorch#184784 Approved by: https://github.com/NicolasHug, https://github.com/malfet

pytorch-bot Bot added ciflow/rocm module: rocm labels May 11, 2026

meta-cla Bot added the cla signed label May 11, 2026

meta-codesync Bot added fb-exported meta-exported labels May 11, 2026

sanrise force-pushed the export-D103952982 branch from d587923 to 555140c Compare May 15, 2026 21:32

meta-codesync Bot closed this in 799b5f4 May 15, 2026

facebook-github-tools Bot added the Merged label May 15, 2026

This was referenced May 18, 2026

[Rocprofiler-sdk] All AMD memcpy events are being placed on stream 0 #1351

Closed

Update third_party/kineto submodule to 5902263 pytorch/pytorch#184784

Closed

This was referenced May 23, 2026

[review-only] PR1: fix(rocm): attribute per-stream GPU activities under rocprofiler-sdk ajassani/kineto#1

Draft

[review-only] PR2: feat(rocm): expose sync-API targets + inter-stream deps (CUPTI parity) ajassani/kineto#2

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ROCm HtoD memcpy stream attribution#1398

Fix ROCm HtoD memcpy stream attribution#1398
sanrise wants to merge 1 commit into
pytorch:mainfrom
sanrise:export-D103952982

sanrise commented May 11, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented May 11, 2026

Uh oh!

sanrise commented May 11, 2026

Uh oh!

mwootton commented May 14, 2026

Uh oh!

meta-codesync Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sanrise commented May 11, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented May 11, 2026

Uh oh!

sanrise commented May 11, 2026

Uh oh!

mwootton commented May 14, 2026

Uh oh!

meta-codesync Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sanrise commented May 11, 2026 •

edited by meta-codesync Bot

Loading