I have a question about the comment here:
|
// Arrive only at the leader CTA |
Since TMEM is consumed by both CTAs, the next line uses 2 * kNumEpilogueThreads. Shouldn't this barrier therefore use arrive at all CTAs instead?
Conversely, for tmem_full_barrier above, it seems that only the leader CTA calls arrive. Is my understanding correct?
I have a question about the comment here:
DeepGEMM/deep_gemm/include/deep_gemm/impls/sm100_fp8_fp4_mega_moe.cuh
Line 269 in 54e2261
Since TMEM is consumed by both CTAs, the next line uses 2 * kNumEpilogueThreads. Shouldn't this barrier therefore use arrive at all CTAs instead?
Conversely, for tmem_full_barrier above, it seems that only the leader CTA calls arrive. Is my understanding correct?