Skip to content

[xpupti]: emit GPU_USER_ANNOTATION for user record_function ranges#1393

Open
SlawomirLaba wants to merge 1 commit into
pytorch:mainfrom
SlawomirLaba:dev/slabax/gpu_user_annotation
Open

[xpupti]: emit GPU_USER_ANNOTATION for user record_function ranges#1393
SlawomirLaba wants to merge 1 commit into
pytorch:mainfrom
SlawomirLaba:dev/slabax/gpu_user_annotation

Conversation

@SlawomirLaba
Copy link
Copy Markdown

@SlawomirLaba SlawomirLaba commented May 8, 2026

The XPU PTI plugin already wires user correlation IDs into PTI via
ptiViewPushExternalCorrelationId / PTI_VIEW_EXTERNAL_KIND_CUSTOM_1 and
populates userCorrelationMap_ in handleCorrelationActivity(), but it
never produced any GPU_USER_ANNOTATION events from that information,
so user record_function() ranges did not appear on the device timeline.

After a GPU activity (kernel, memcpy, memset) is logged, look up its
correlation id in userCorrelationMap_ and resolve the originating CPU
activity via the linked-activity callback. The first resolved hit on a
given (device, stream, user_corr_id) emplaces a synthesized
GPU_USER_ANNOTATION GenericTraceActivity; subsequent hits on the same
key widen its [startTime, endTime] so the annotation spans the union
of all GPU activities on that stream that share the user correlation
id, matching the behavior of GpuUserEventMap::insertOrExtendEvent in
the generic profiler. The accumulated annotations are logged once at
the end of processTrace and the per-session map is cleared between
iterations.

The lookup is gated on a successful callback resolution rather than a
separate dedup set, so a transient nullptr return (CPU side not yet
visible when the first GPU activity is processed) is retried on later
GPU activities for the same range.

Only emit the synthesized event when the caller requested
ActivityType::GPU_USER_ANNOTATION, to avoid producing extra events for
clients that do not opt in.

Tests

Extend RunProfilerTest with optional parameters (userCorrelationId,
linkedCpuActivity, linkedActivityCallback) so existing test scaffolding
can drive a session that pushes/pops a user correlation id around the
XPU workload and resolves it back to a CPU-side activity through the
linked-activity callback. Existing tests are unaffected (defaults are
0 / nullptr).

Add XpuptiProfilerTest.GpuUserAnnotation, which enables
GPU_USER_ANNOTATION alongside the existing GPU/runtime activities,
runs the XPU compute helper inside a user correlation range, and
asserts that exactly one synthesized "user_function" annotation is
emitted per participating GPU stream (memcpy stream and kernel
stream).

Add XpuptiProfilerTest.GpuUserAnnotationLinkedActivityRetry, which
uses a callback that returns nullptr on the first lookup and the
linked CPU activity thereafter, and verifies that the plugin still
emits one annotation per stream -- pinning down the
retry-until-resolved contract on the linked-activity callback.

@meta-cla
Copy link
Copy Markdown

meta-cla Bot commented May 8, 2026

Hi @SlawomirLaba!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@scotts scotts assigned scotts and unassigned scotts May 11, 2026
@scotts
Copy link
Copy Markdown
Contributor

scotts commented May 11, 2026

cc: @gujinghui, @chuanqi129

@meta-cla meta-cla Bot added the cla signed label May 13, 2026
@SlawomirLaba SlawomirLaba force-pushed the dev/slabax/gpu_user_annotation branch from fd1f42c to 11e3666 Compare May 14, 2026 12:27
The XPU PTI plugin already wires user correlation IDs into PTI via
ptiViewPushExternalCorrelationId / PTI_VIEW_EXTERNAL_KIND_CUSTOM_1 and
populates userCorrelationMap_ in handleCorrelationActivity(), but it
never produced any GPU_USER_ANNOTATION events from that information,
so user record_function() ranges did not appear on the device timeline.

After a GPU activity (kernel, memcpy, memset) is logged, look up its
correlation id in userCorrelationMap_ and resolve the originating CPU
activity via the linked-activity callback. The first resolved hit on a
given (device, stream, user_corr_id) emplaces a synthesized
GPU_USER_ANNOTATION GenericTraceActivity; subsequent hits on the same
key widen its [startTime, endTime] so the annotation spans the union
of all GPU activities on that stream that share the user correlation
id, matching the behavior of GpuUserEventMap::insertOrExtendEvent in
the generic profiler. The accumulated annotations are logged once at
the end of processTrace and the per-session map is cleared between
iterations.

The lookup is gated on a successful callback resolution rather than a
separate dedup set, so a transient nullptr return (CPU side not yet
visible when the first GPU activity is processed) is retried on later
GPU activities for the same range.

Only emit the synthesized event when the caller requested
ActivityType::GPU_USER_ANNOTATION, to avoid producing extra events for
clients that do not opt in.

Tests
-----
Extend RunProfilerTest with optional parameters (userCorrelationId,
linkedCpuActivity, linkedActivityCallback) so existing test scaffolding
can drive a session that pushes/pops a user correlation id around the
XPU workload and resolves it back to a CPU-side activity through the
linked-activity callback. Existing tests are unaffected (defaults are
0 / nullptr).

Add XpuptiProfilerTest.GpuUserAnnotation, which enables
GPU_USER_ANNOTATION alongside the existing GPU/runtime activities,
runs the XPU compute helper inside a user correlation range, and
asserts that exactly one synthesized "user_function" annotation is
emitted per participating GPU stream (memcpy stream and kernel
stream).

Add XpuptiProfilerTest.GpuUserAnnotationLinkedActivityRetry, which
uses a callback that returns nullptr on the first lookup and the
linked CPU activity thereafter, and verifies that the plugin still
emits one annotation per stream -- pinning down the
retry-until-resolved contract on the linked-activity callback.
@SlawomirLaba SlawomirLaba force-pushed the dev/slabax/gpu_user_annotation branch from 11e3666 to 027183c Compare May 14, 2026 12:37
@SlawomirLaba SlawomirLaba changed the title [xpupti] Emit GPU_USER_ANNOTATION events from user correlation map [xpupti]: emit GPU_USER_ANNOTATION for user record_function ranges May 14, 2026
@SlawomirLaba SlawomirLaba marked this pull request as ready for review May 14, 2026 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants