[review-only] PR2: feat(rocm): expose sync-API targets + inter-stream deps (CUPTI parity)#2
Draft
ajassani wants to merge 8 commits into
Draft
Conversation
Capture hipEventRecord and hipStreamWaitEvent calls during profiling to expose inter-stream synchronization dependencies in Chrome/Perfetto traces. This data enables downstream tools to reconstruct the communication-compute overlap and causal ordering between GPU streams in multi-stream workloads (e.g. DDP, pipeline parallelism). New activity types: - rocprofEventRecordRow: records which event was placed on which stream - rocprofSyncRow: records sync operations with source stream/event info The metadata is emitted as JSON fields (hip_event, hip_stream, sync_type, wait_on_stream, wait_on_hip_event_record_corr_id) that can be parsed by trace analysis tools to build a dependency graph. hipEventRecord is removed from the callback omit list so that event record calls are captured by the profiler. Test: InterStreamDependencyTest verifies that event record and sync activities are emitted with correct dependency metadata in the Chrome trace JSON output.
Bug A: unresolved hipStreamWaitEvent events dropped the event handle entirely, making it impossible to reconstruct the dependency. Bug B: resolved events emitted wait_on_stream (producer) but lost hip_stream (consumer). Now all sync events always emit hip_stream (consumer) and hip_event when available, plus the wait_on_* fields when the lookup succeeds.
The consumer stream in hipStreamWaitEvent can be the default stream (null pointer), which is still meaningful for dependency resolution. Emit it unconditionally so the post-processor can always find both producer and consumer.
The default/compute stream in HIP is represented as nullptr (0x0). When hipEventRecord records an event on the default stream, g_eventMap stores srcStream=nullptr. The callback correctly retrieves this, but metadataJson() was checking raw().srcStream as a null-pointer guard, which silently dropped the resolution for all default-stream events. This caused 50% of hipStreamWaitEvent dependencies (the comp→comm direction in DDP) to appear unresolved in the trace despite the g_eventMap lookup succeeding at callback time. Fix: check srcCorrId (non-zero when lookup succeeded) instead of srcStream (which is legitimately nullptr for the default stream).
HIP event handles are routinely reused: applications call hipEventRecord
on the same hipEvent_t many times across a trace. The previous
single-entry map (hipEvent_t -> {stream, corrId}) collapsed all those
records onto the most recent one, which misattributes the producer of a
hipStreamWaitEvent whose own callback fires after a subsequent record on
the same event.
Switch g_eventMap to vector<EventMapEntry>, keep the per-handle vector
sorted by correlationId on insert via std::lower_bound, and resolve
hipStreamWaitEvent by std::upper_bound + prev to pick the most recent
record whose correlationId strictly precedes the wait's. Clear the map
in RocmActivityProfiler::onResetTraceData so a session boundary cannot
leak stale entries into the next trace.
Mirrors the CUPTI design in CuptiActivityProfiler.cpp:waitEventMap()
+ updateWaitEventMap + getWaitEventInfo for CUPTI parity.
Co-authored-by: Cursor <cursoragent@cursor.com>
…_event_id Two CUPTI-parity gaps in the rocprofiler-sdk sync metadata: 1. Only `hipStreamWaitEvent` looked up `g_eventMap` to attribute the producer record. `hipEventSynchronize`, which on the CUPTI side falls into the same `isEventSync()` bucket (STREAM_WAIT_EVENT and EVENT_SYNCHRONIZE both consult `waitEventMap`), got no producer attribution at all. Refactor the lookup out of the switch so both sync types share it. 2. The emitted JSON lacked a `wait_on_hip_event_id` field. CUPTI's `eventSyncInfo` always emits `wait_on_cuda_event_id` (the event handle itself) regardless of whether a producer record was found, so tooling can correlate waits even when the corresponding record fell outside the trace window. Mirror that: always emit `wait_on_hip_event_id` for the two event-sync types, and add the producer pair (`wait_on_stream`, `wait_on_hip_event_record_corr_id`) only when `g_eventMap` resolved one. Field-name map vs CUPTI is now: wait_on_stream <=> wait_on_stream wait_on_hip_event_record_corr_id <=> wait_on_cuda_event_record_corr_id wait_on_hip_event_id <=> wait_on_cuda_event_id Co-authored-by: Cursor <cursoragent@cursor.com>
Adds four new tests that mirror CUPTI's wait-event coverage in
CuptiActivityProfilerTest.cpp:
StreamWaitEventFutureCorrelation
Two hipEventRecord callbacks on the same hipEvent_t (corr 100 then
200) and a hipStreamWaitEvent at corr 101 in between. Without the
vector-backed g_eventMap, the wait would be attributed to the more
recent (corr 200) record; with it, the wait correctly resolves to
corr 100. Exercises lower_bound insert + upper_bound lookup.
EventMapClearedOnReset
Records an event in session 1, drives RocmActivityProfiler::reset(),
then issues a wait in session 2 with no prior record. The wait must
NOT pick up session 1's stale entry.
EventSynchronizeResolvesProducer
Records an event and then calls hipEventSynchronize. Verifies that
the sync gets producer attribution (previously only
hipStreamWaitEvent did) and that wait_on_hip_event_id is emitted.
UnresolvedWaitStillEmitsEventId
Wait fires on an event whose record predated the trace window.
Verifies wait_on_hip_event_id is still emitted (tooling can match
waits against records observed elsewhere) while
wait_on_hip_event_record_corr_id / wait_on_stream are omitted.
To drive the production lookup path from tests, the inline g_eventMap
operations in the api_callback were extracted into three static
helpers on RocprofLogger:
recordEvent(event, stream, corrId)
resolveWait(event, queryCorrId, &outStream, &outCorrId)
clearEventMap()
MockRocLogger now calls recordEvent() from addEventRecordActivity() and
exposes addSyncActivityResolvingFromMap() so test traces exercise the
exact same lookup the real callback does.
Co-authored-by: Cursor <cursoragent@cursor.com>
The rocprofiler-sdk and HSA headers shipped with ROCm 7.x use GCC
extensions (anonymous structs, flexible array members) that violate
strict ISO C++. When the consumer build propagates `-Wpedantic` to
kineto - PyTorch's cmake/Dependencies.cmake does this for everything
under caffe2_* - compilation of RocprofLogger.cpp, RocprofActivityApi.cpp
and init.cpp fails with errors like:
/opt/rocm-7.2.3/include/rocprofiler-sdk/fwd.h:757:9:
error: ISO C++ prohibits anonymous structs [-Werror=pedantic]
These headers are owned by ROCm, not by us, so the fix lives on the
consumer side. Add `-Wno-pedantic` to KINETO_COMPILE_OPTIONS for the
rocm backend only; this turns off pedantic warnings for the kineto
translation units that include the system headers, leaving the rest of
the build untouched.
Limiting the suppression to GCC/Clang avoids touching MSVC builds.
Co-authored-by: Cursor <cursoragent@cursor.com>
d862fbf to
e097ad6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Heads-up before review
Sanghani, Meta, merged 2026-05-15) introduces
RocmStreamQueue.h— ahelper for ROCm stream/queue mapping. PR2 does not depend on it
(PR2 is about sync-API metadata, not stream attribution), but reviewers
may ask whether parts of PR2 should plug into the new helper. Worth a
read before going upstream.
a0fc873"honour legacyLIBKINETO_NO*"shim commit has been dropped from this branch — upstream PR
#1399 removed the very
shim it re-added, so it was DOA. To keep our local dev container
building, we'll need a separate strategy (env var
KINETO_BACKEND=rocm, local Dependencies.cmake patch outside thesubmodule, or bumping PyTorch past Switch PyTorch's kineto build glue from LIBKINETO_NO* to KINETO_BACKEND pytorch/pytorch#183388). This PR no
longer carries the shim.
e097ad6"suppress-WpedanticonROCm builds against system rocprofiler-sdk". This is real (ROCm 7.x
headers trip pedantic) and should survive upstream review on its own
merits, possibly split into a separate prep-PR if reviewers prefer.
(see
project-docs/kineto-pr-series-plan.md).Original PR body
feat(rocm): expose sync-API targets + inter-stream dependencies (CUPTI parity)
Problem
ROCm Kineto traces show synchronization API rows (
hipEventRecord,hipStreamWaitEvent,hipEventSynchronize,hipStreamSynchronize,hipDeviceSynchronize) as opaque CPU-side timing rows with no metadata.Going from "the CPU thread sat blocked for 5 ms" to "the CPU thread was
blocked waiting on event E that was recorded by stream A in this kernel
launch" is impossible. There is no resolved producer information, no
event handle, no stream pointer in the trace.
CUPTI emits exactly this metadata via
CUPTI_ACTIVITY_KIND_SYNCHRONIZATIONrecords (
waitOnStreamId,waitOnCudaEventRecordCorrId,waitOnCudaEventId, etc.). This PR brings the ROCm rocprofiler-sdk backendto parity.
Design
CUPTI parity is the trump card
Every choice in this PR cites the equivalent CUPTI field. The intent is that
a reviewer can open
CuptiActivity.hand see the same shape:hip_eventCUpti_ActivityCudaEvent.eventIdhip_streamCUpti_ActivityCudaEvent.streamIdsync_typeCUpti_ActivitySynchronization2.typewait_on_streamCUpti_ActivitySynchronization2.streamIdwait_on_hip_event_idwait_on_cuda_event_idwait_on_hip_event_record_corr_idwait_on_cuda_event_record_corr_idResolve eagerly, preserve raw fields
For every
hipEventRecordwe record{stream, correlation_id}ing_eventMap[hipEvent_t]. For every wait/sync API we look it up and emit aresolved producer alongside the raw event handle. If lookup fails (record
delivered out of order, or producer was never traced) we still emit the
event/stream fields so the trace viewer can attempt a partial dependency.
Vector-backed event map (handles
hipEvent_treuse)A
hipEvent_thandle can be re-recorded many times in one trace. The mapstores a sorted vector of
{correlation_id, stream}entries per handle,inserted with
std::lower_boundand looked up withstd::upper_boundtofind the most recent record up to a given correlation_id. Mirrors the
upstream NVIDIA-side fix in pytorch#1339.
Tests
libkineto/test/RocmActivityProfilerTest.cpp(+423 LOC) adds testsmirroring CUPTI's wait-event coverage:
InterStreamDependencyTesthipEventRecord+hipStreamWaitEventresolutionStreamWaitEventFutureCorrelationEventMapClearedOnResetclearEventMap()is invoked ononResetTraceDataEventSynchronizeResolvesProducerhipEventSynchronizecarries resolved producer metadataUnresolvedWaitStillEmitsEventIdwait_on_hip_event_idHelper methods (
recordEvent,resolveWait,clearEventMap) are extractedas static class methods in
RocprofLoggerso tests exercise productionlookup logic, not hardcoded data.
Trace evidence
Captured under
ground-control/cache/kineto-pr-validation/using adeterministic 2-stream +
hipEventRecord/hipStreamWaitEventworkload:3-way comparison run with
diff_baseline_vs_patched.py. PASS.Backend parity
PR3 (out of scope here, separate review) mirrors this design on the
roctracer path so existing PyTorch ROCm wheel users see the same fields
once the wheel ships with this Kineto pin.
Out of scope
new
CUDA_SYNCactivity kind for ROCm; ROCm sync APIs emit asCUDA_RUNTIMEtoday, so there is no GPU-row collision to dodge).