question about tmem barrier

I have a question about the comment here:

https://github.com/deepseek-ai/DeepGEMM/blob/54e22612409371d6364144b69086735beb54e98b/deep_gemm/include/deep_gemm/impls/sm100_fp8_fp4_mega_moe.cuh#L269

Since TMEM is consumed by both CTAs, the next line uses 2 * kNumEpilogueThreads. Shouldn't this barrier therefore use arrive at all CTAs instead?

Conversely, for tmem_full_barrier above, it seems that only the leader CTA calls arrive. Is my understanding correct?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

question about tmem barrier #368

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

question about tmem barrier #368

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions