[xpupti]: emit GPU_USER_ANNOTATION for user record_function ranges#1393
[xpupti]: emit GPU_USER_ANNOTATION for user record_function ranges#1393SlawomirLaba wants to merge 1 commit into
Conversation
|
Hi @SlawomirLaba! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
|
cc: @gujinghui, @chuanqi129 |
fd1f42c to
11e3666
Compare
The XPU PTI plugin already wires user correlation IDs into PTI via ptiViewPushExternalCorrelationId / PTI_VIEW_EXTERNAL_KIND_CUSTOM_1 and populates userCorrelationMap_ in handleCorrelationActivity(), but it never produced any GPU_USER_ANNOTATION events from that information, so user record_function() ranges did not appear on the device timeline. After a GPU activity (kernel, memcpy, memset) is logged, look up its correlation id in userCorrelationMap_ and resolve the originating CPU activity via the linked-activity callback. The first resolved hit on a given (device, stream, user_corr_id) emplaces a synthesized GPU_USER_ANNOTATION GenericTraceActivity; subsequent hits on the same key widen its [startTime, endTime] so the annotation spans the union of all GPU activities on that stream that share the user correlation id, matching the behavior of GpuUserEventMap::insertOrExtendEvent in the generic profiler. The accumulated annotations are logged once at the end of processTrace and the per-session map is cleared between iterations. The lookup is gated on a successful callback resolution rather than a separate dedup set, so a transient nullptr return (CPU side not yet visible when the first GPU activity is processed) is retried on later GPU activities for the same range. Only emit the synthesized event when the caller requested ActivityType::GPU_USER_ANNOTATION, to avoid producing extra events for clients that do not opt in. Tests ----- Extend RunProfilerTest with optional parameters (userCorrelationId, linkedCpuActivity, linkedActivityCallback) so existing test scaffolding can drive a session that pushes/pops a user correlation id around the XPU workload and resolves it back to a CPU-side activity through the linked-activity callback. Existing tests are unaffected (defaults are 0 / nullptr). Add XpuptiProfilerTest.GpuUserAnnotation, which enables GPU_USER_ANNOTATION alongside the existing GPU/runtime activities, runs the XPU compute helper inside a user correlation range, and asserts that exactly one synthesized "user_function" annotation is emitted per participating GPU stream (memcpy stream and kernel stream). Add XpuptiProfilerTest.GpuUserAnnotationLinkedActivityRetry, which uses a callback that returns nullptr on the first lookup and the linked CPU activity thereafter, and verifies that the plugin still emits one annotation per stream -- pinning down the retry-until-resolved contract on the linked-activity callback.
11e3666 to
027183c
Compare
The XPU PTI plugin already wires user correlation IDs into PTI via
ptiViewPushExternalCorrelationId / PTI_VIEW_EXTERNAL_KIND_CUSTOM_1 and
populates userCorrelationMap_ in handleCorrelationActivity(), but it
never produced any GPU_USER_ANNOTATION events from that information,
so user record_function() ranges did not appear on the device timeline.
After a GPU activity (kernel, memcpy, memset) is logged, look up its
correlation id in userCorrelationMap_ and resolve the originating CPU
activity via the linked-activity callback. The first resolved hit on a
given (device, stream, user_corr_id) emplaces a synthesized
GPU_USER_ANNOTATION GenericTraceActivity; subsequent hits on the same
key widen its [startTime, endTime] so the annotation spans the union
of all GPU activities on that stream that share the user correlation
id, matching the behavior of GpuUserEventMap::insertOrExtendEvent in
the generic profiler. The accumulated annotations are logged once at
the end of processTrace and the per-session map is cleared between
iterations.
The lookup is gated on a successful callback resolution rather than a
separate dedup set, so a transient nullptr return (CPU side not yet
visible when the first GPU activity is processed) is retried on later
GPU activities for the same range.
Only emit the synthesized event when the caller requested
ActivityType::GPU_USER_ANNOTATION, to avoid producing extra events for
clients that do not opt in.
Tests
Extend RunProfilerTest with optional parameters (userCorrelationId,
linkedCpuActivity, linkedActivityCallback) so existing test scaffolding
can drive a session that pushes/pops a user correlation id around the
XPU workload and resolves it back to a CPU-side activity through the
linked-activity callback. Existing tests are unaffected (defaults are
0 / nullptr).
Add XpuptiProfilerTest.GpuUserAnnotation, which enables
GPU_USER_ANNOTATION alongside the existing GPU/runtime activities,
runs the XPU compute helper inside a user correlation range, and
asserts that exactly one synthesized "user_function" annotation is
emitted per participating GPU stream (memcpy stream and kernel
stream).
Add XpuptiProfilerTest.GpuUserAnnotationLinkedActivityRetry, which
uses a callback that returns nullptr on the first lookup and the
linked CPU activity thereafter, and verifies that the plugin still
emits one annotation per stream -- pinning down the
retry-until-resolved contract on the linked-activity callback.