Skip to content

[review-only] PR2: feat(rocm): expose sync-API targets + inter-stream deps (CUPTI parity)#2

Draft
ajassani wants to merge 8 commits into
pr1/stream-mapping-rebasedfrom
pr2/interstream-deps-rebased
Draft

[review-only] PR2: feat(rocm): expose sync-API targets + inter-stream deps (CUPTI parity)#2
ajassani wants to merge 8 commits into
pr1/stream-mapping-rebasedfrom
pr2/interstream-deps-rebased

Conversation

@ajassani
Copy link
Copy Markdown
Owner

@ajassani ajassani commented May 23, 2026

Fork-only PR for Adeem's review. Do not merge — this PR exists so we can
discuss inline before sending to pytorch/kineto:main.
Base of this PR is pr1/stream-mapping-rebased (PR1) so the diff shows
only PR2's content.

Heads-up before review

  • Upstream PR #1398 (Darshan
    Sanghani, Meta, merged 2026-05-15) introduces RocmStreamQueue.h — a
    helper for ROCm stream/queue mapping. PR2 does not depend on it
    (PR2 is about sync-API metadata, not stream attribution), but reviewers
    may ask whether parts of PR2 should plug into the new helper. Worth a
    read before going upstream.
  • 2026-05-23 update: the a0fc873 "honour legacy LIBKINETO_NO*"
    shim commit has been dropped from this branch — upstream PR
    #1399 removed the very
    shim it re-added, so it was DOA. To keep our local dev container
    building, we'll need a separate strategy (env var
    KINETO_BACKEND=rocm, local Dependencies.cmake patch outside the
    submodule, or bumping PyTorch past Switch PyTorch's kineto build glue from LIBKINETO_NO* to KINETO_BACKEND pytorch/pytorch#183388). This PR no
    longer carries the shim.
  • One build-cleanup commit remains: e097ad6 "suppress -Wpedantic on
    ROCm builds against system rocprofiler-sdk". This is real (ROCm 7.x
    headers trip pedantic) and should survive upstream review on its own
    merits, possibly split into a separate prep-PR if reviewers prefer.
  • Tests pass on both backends as of the L1/L2/L3 validation runs
    (see project-docs/kineto-pr-series-plan.md).

Original PR body

feat(rocm): expose sync-API targets + inter-stream dependencies (CUPTI parity)

Problem

ROCm Kineto traces show synchronization API rows (hipEventRecord,
hipStreamWaitEvent, hipEventSynchronize, hipStreamSynchronize,
hipDeviceSynchronize) as opaque CPU-side timing rows with no metadata.

Going from "the CPU thread sat blocked for 5 ms" to "the CPU thread was
blocked waiting on event E that was recorded by stream A in this kernel
launch" is impossible. There is no resolved producer information, no
event handle, no stream pointer in the trace.

CUPTI emits exactly this metadata via CUPTI_ACTIVITY_KIND_SYNCHRONIZATION
records (waitOnStreamId, waitOnCudaEventRecordCorrId,
waitOnCudaEventId, etc.). This PR brings the ROCm rocprofiler-sdk backend
to parity.

Design

CUPTI parity is the trump card

Every choice in this PR cites the equivalent CUPTI field. The intent is that
a reviewer can open CuptiActivity.h and see the same shape:

ROCm Kineto field CUPTI equivalent
hip_event CUpti_ActivityCudaEvent.eventId
hip_stream CUpti_ActivityCudaEvent.streamId
sync_type CUpti_ActivitySynchronization2.type
wait_on_stream CUpti_ActivitySynchronization2.streamId
wait_on_hip_event_id wait_on_cuda_event_id
wait_on_hip_event_record_corr_id wait_on_cuda_event_record_corr_id

Resolve eagerly, preserve raw fields

For every hipEventRecord we record {stream, correlation_id} in
g_eventMap[hipEvent_t]. For every wait/sync API we look it up and emit a
resolved producer alongside the raw event handle. If lookup fails (record
delivered out of order, or producer was never traced) we still emit the
event/stream fields so the trace viewer can attempt a partial dependency.

Vector-backed event map (handles hipEvent_t reuse)

A hipEvent_t handle can be re-recorded many times in one trace. The map
stores a sorted vector of {correlation_id, stream} entries per handle,
inserted with std::lower_bound and looked up with std::upper_bound to
find the most recent record up to a given correlation_id. Mirrors the
upstream NVIDIA-side fix in pytorch#1339.

Tests

libkineto/test/RocmActivityProfilerTest.cpp (+423 LOC) adds tests
mirroring CUPTI's wait-event coverage:

Test What it covers
InterStreamDependencyTest End-to-end hipEventRecord + hipStreamWaitEvent resolution
StreamWaitEventFutureCorrelation Out-of-order callback delivery; the wait sees the future record
EventMapClearedOnReset clearEventMap() is invoked on onResetTraceData
EventSynchronizeResolvesProducer hipEventSynchronize carries resolved producer metadata
UnresolvedWaitStillEmitsEventId Missing producer → still emit raw wait_on_hip_event_id

Helper methods (recordEvent, resolveWait, clearEventMap) are extracted
as static class methods in RocprofLogger so tests exercise production
lookup logic, not hardcoded data.

Trace evidence

Captured under ground-control/cache/kineto-pr-validation/ using a
deterministic 2-stream + hipEventRecord/hipStreamWaitEvent workload:

                          pure-stock   PR1-only    PR1+PR2
hipEventRecord args:      —            —           hip_event + hip_stream (8/8)
hipStreamWaitEvent args:  —            —           wait_on_stream + wait_on_hip_event_record_corr_id (8/8 resolved)
hipDeviceSynchronize:     —            —           sync_type + hip_stream
total events:             265          265         265   (no spurious events)

3-way comparison run with diff_baseline_vs_patched.py. PASS.

Backend parity

PR3 (out of scope here, separate review) mirrors this design on the
roctracer path so existing PyTorch ROCm wheel users see the same fields
once the wheel ships with this Kineto pin.

Out of scope

  • Roctracer port (separate PR3).
  • "Sync events on the GPU stream row" GPU-side visualization (would need a
    new CUDA_SYNC activity kind for ROCm; ROCm sync APIs emit as
    CUDA_RUNTIME today, so there is no GPU-row collision to dodge).

ajassani and others added 8 commits May 22, 2026 16:39
Capture hipEventRecord and hipStreamWaitEvent calls during profiling to
expose inter-stream synchronization dependencies in Chrome/Perfetto
traces.  This data enables downstream tools to reconstruct the
communication-compute overlap and causal ordering between GPU streams
in multi-stream workloads (e.g. DDP, pipeline parallelism).

New activity types:
  - rocprofEventRecordRow: records which event was placed on which stream
  - rocprofSyncRow: records sync operations with source stream/event info

The metadata is emitted as JSON fields (hip_event, hip_stream,
sync_type, wait_on_stream, wait_on_hip_event_record_corr_id) that can
be parsed by trace analysis tools to build a dependency graph.

hipEventRecord is removed from the callback omit list so that event
record calls are captured by the profiler.

Test: InterStreamDependencyTest verifies that event record and sync
activities are emitted with correct dependency metadata in the Chrome
trace JSON output.
Bug A: unresolved hipStreamWaitEvent events dropped the event handle
entirely, making it impossible to reconstruct the dependency.
Bug B: resolved events emitted wait_on_stream (producer) but lost
hip_stream (consumer).

Now all sync events always emit hip_stream (consumer) and hip_event
when available, plus the wait_on_* fields when the lookup succeeds.
The consumer stream in hipStreamWaitEvent can be the default stream
(null pointer), which is still meaningful for dependency resolution.
Emit it unconditionally so the post-processor can always find both
producer and consumer.
The default/compute stream in HIP is represented as nullptr (0x0).
When hipEventRecord records an event on the default stream, g_eventMap
stores srcStream=nullptr. The callback correctly retrieves this, but
metadataJson() was checking raw().srcStream as a null-pointer guard,
which silently dropped the resolution for all default-stream events.

This caused 50% of hipStreamWaitEvent dependencies (the comp→comm
direction in DDP) to appear unresolved in the trace despite the
g_eventMap lookup succeeding at callback time.

Fix: check srcCorrId (non-zero when lookup succeeded) instead of
srcStream (which is legitimately nullptr for the default stream).
HIP event handles are routinely reused: applications call hipEventRecord
on the same hipEvent_t many times across a trace. The previous
single-entry map (hipEvent_t -> {stream, corrId}) collapsed all those
records onto the most recent one, which misattributes the producer of a
hipStreamWaitEvent whose own callback fires after a subsequent record on
the same event.

Switch g_eventMap to vector<EventMapEntry>, keep the per-handle vector
sorted by correlationId on insert via std::lower_bound, and resolve
hipStreamWaitEvent by std::upper_bound + prev to pick the most recent
record whose correlationId strictly precedes the wait's. Clear the map
in RocmActivityProfiler::onResetTraceData so a session boundary cannot
leak stale entries into the next trace.

Mirrors the CUPTI design in CuptiActivityProfiler.cpp:waitEventMap()
+ updateWaitEventMap + getWaitEventInfo for CUPTI parity.

Co-authored-by: Cursor <cursoragent@cursor.com>
…_event_id

Two CUPTI-parity gaps in the rocprofiler-sdk sync metadata:

1. Only `hipStreamWaitEvent` looked up `g_eventMap` to attribute the
   producer record. `hipEventSynchronize`, which on the CUPTI side falls
   into the same `isEventSync()` bucket (STREAM_WAIT_EVENT and
   EVENT_SYNCHRONIZE both consult `waitEventMap`), got no producer
   attribution at all. Refactor the lookup out of the switch so both
   sync types share it.

2. The emitted JSON lacked a `wait_on_hip_event_id` field. CUPTI's
   `eventSyncInfo` always emits `wait_on_cuda_event_id` (the event
   handle itself) regardless of whether a producer record was found, so
   tooling can correlate waits even when the corresponding record fell
   outside the trace window. Mirror that: always emit
   `wait_on_hip_event_id` for the two event-sync types, and add the
   producer pair (`wait_on_stream`, `wait_on_hip_event_record_corr_id`)
   only when `g_eventMap` resolved one.

Field-name map vs CUPTI is now:
  wait_on_stream                    <=> wait_on_stream
  wait_on_hip_event_record_corr_id  <=> wait_on_cuda_event_record_corr_id
  wait_on_hip_event_id              <=> wait_on_cuda_event_id

Co-authored-by: Cursor <cursoragent@cursor.com>
Adds four new tests that mirror CUPTI's wait-event coverage in
CuptiActivityProfilerTest.cpp:

  StreamWaitEventFutureCorrelation
    Two hipEventRecord callbacks on the same hipEvent_t (corr 100 then
    200) and a hipStreamWaitEvent at corr 101 in between. Without the
    vector-backed g_eventMap, the wait would be attributed to the more
    recent (corr 200) record; with it, the wait correctly resolves to
    corr 100. Exercises lower_bound insert + upper_bound lookup.

  EventMapClearedOnReset
    Records an event in session 1, drives RocmActivityProfiler::reset(),
    then issues a wait in session 2 with no prior record. The wait must
    NOT pick up session 1's stale entry.

  EventSynchronizeResolvesProducer
    Records an event and then calls hipEventSynchronize. Verifies that
    the sync gets producer attribution (previously only
    hipStreamWaitEvent did) and that wait_on_hip_event_id is emitted.

  UnresolvedWaitStillEmitsEventId
    Wait fires on an event whose record predated the trace window.
    Verifies wait_on_hip_event_id is still emitted (tooling can match
    waits against records observed elsewhere) while
    wait_on_hip_event_record_corr_id / wait_on_stream are omitted.

To drive the production lookup path from tests, the inline g_eventMap
operations in the api_callback were extracted into three static
helpers on RocprofLogger:

  recordEvent(event, stream, corrId)
  resolveWait(event, queryCorrId, &outStream, &outCorrId)
  clearEventMap()

MockRocLogger now calls recordEvent() from addEventRecordActivity() and
exposes addSyncActivityResolvingFromMap() so test traces exercise the
exact same lookup the real callback does.

Co-authored-by: Cursor <cursoragent@cursor.com>
The rocprofiler-sdk and HSA headers shipped with ROCm 7.x use GCC
extensions (anonymous structs, flexible array members) that violate
strict ISO C++. When the consumer build propagates `-Wpedantic` to
kineto - PyTorch's cmake/Dependencies.cmake does this for everything
under caffe2_* - compilation of RocprofLogger.cpp, RocprofActivityApi.cpp
and init.cpp fails with errors like:

  /opt/rocm-7.2.3/include/rocprofiler-sdk/fwd.h:757:9:
    error: ISO C++ prohibits anonymous structs [-Werror=pedantic]

These headers are owned by ROCm, not by us, so the fix lives on the
consumer side. Add `-Wno-pedantic` to KINETO_COMPILE_OPTIONS for the
rocm backend only; this turns off pedantic warnings for the kineto
translation units that include the system headers, leaving the rest of
the build untouched.

Limiting the suppression to GCC/Clang avoids touching MSVC builds.

Co-authored-by: Cursor <cursoragent@cursor.com>
@ajassani ajassani force-pushed the pr2/interstream-deps-rebased branch from d862fbf to e097ad6 Compare May 23, 2026 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant