Fix: AICore Phase 1 stale L1 cache read on aicpu_ready#595
Fix: AICore Phase 1 stale L1 cache read on aicpu_ready#595yanghaoran29 wants to merge 1 commit intohw-native-sys:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces L1 cache invalidation at the entry point of the aicore_execute function across multiple runtime implementations. This change is designed to prevent stale cache data from previous kernel runs from incorrectly satisfying the AICPU initialization handshake. The review feedback consistently suggests using dci (Invalidate) instead of dcci (Clean and Invalidate) to eliminate the risk of stale data being written back to memory and potentially overwriting initialization values from the Host or AICPU.
834ba99 to
7e039f8
Compare
Without rtDeviceReset between runs (removed in 8be358b), the first load of aicpu_ready in aicore_execute()'s Phase 1 spin can L1-hit a residual 1 from a prior run. The while-loop body's dcci never runs, AICore proceeds before AICPU finishes init, and downstream task dispatch scrambles -- producing ~5% precision failures on paged_attention Case1 -> small1 (max_diff ~ 0.3). Fix by invalidating our Handshake L1 cache line at AICore kernel exit, right after the existing flush (CACHELINE_OUT). Placing the invalidate at the writer's exit (instead of the reader's entry) keeps state reset self-contained within the case that produced it, so the next case's first load of aicpu_ready is guaranteed to miss L1 and read HBM. Applied identically to all five AICore executors (a2a3 host_build_graph / aicpu_build_graph / tensormap_and_ringbuffer; a5 host_build_graph / tensormap_and_ringbuffer). Validated: 100/100 PASS on host_build_graph paged_attention (was ~5% fail).
7e039f8 to
54ffc7e
Compare
Bug Fix: AICore Phase 1 Stale L1 Cache ReadSymptomWhen running the
Relevant recent changeThe only recent change to Root causeIn __gm__ Handshake *my_hank = &runtime->workers[block_idx];
// Phase 1: Wait for AICPU initialization signal
while (my_hank->aicpu_ready == 0) {
dcci(my_hank, SINGLE_CACHE_LINE);
}The execution order of this loop is: Because commit Direct evidence: a snapshot probe placed at AICore kernel entry (before any Why single-case runs are fine but "load-once + multi-case" isn'tThis is the same root cause showing up differently under two execution modes, and it hinges on one fact: AICore L1 cache lifetime is tied to Mode A: one case per process
Mode B: load once, run multiple cases sequentially (the failure path)
Side-by-side
In other words: the Why "binaries don't unload" is a precondition for cross-case residueAscend uses a three-part The per-case lifecycle looks like this: Consequences:
In short, multiple cases sharing a single already-loaded binary is one of the necessary preconditions that allows L1 residue to survive across cases. FixAdd one // ... execution done, about to exit ...
// Flush all dirty cache lines to HBM before kernel exit.
dcci(my_hank, SINGLE_CACHE_LINE, CACHELINE_OUT);
// Invalidate our Handshake L1 line on exit so the next case on this core
// sees a fresh aicpu_ready=0 on its first load instead of an L1-resident 1
// left over from this case (no rtDeviceReset between cases).
dcci(my_hank, SINGLE_CACHE_LINE);Two
Why append at exit instead of prepend at entry: every Files changedThe same Phase 1 pattern exists in all five AICore executors. Each gets the same patch (4 lines appended after the existing exit flush: 3 comment lines + 1 invalidate
Validation
100/100 PASS across two independent runs. Scope and follow-ups
|
Without rtDeviceReset between runs (removed in 8be358b), the first load of aicpu_ready in aicore_execute()'s Phase 1 spin can L1-hit a residual 1 from a prior run. The while-loop body's dcci never runs, AICore proceeds before AICPU finishes init, and downstream task dispatch scrambles -- producing ~5% precision failures on paged_attention
Case1 -> small1 (max_diff ~ 0.3).
Fix by invalidating our Handshake L1 cache line at AICore kernel exit, right after the existing flush (CACHELINE_OUT). Placing the invalidate at the writer's exit (instead of the reader's entry) keeps state reset self-contained within the case that produced it, so the next case's first load of aicpu_ready is guaranteed to miss L1 and read HBM.
Applied identically to all five AICore executors (a2a3 host_build_graph / aicpu_build_graph / tensormap_and_ringbuffer; a5 host_build_graph / tensormap_and_ringbuffer).
Validated: 100/100 PASS on host_build_graph paged_attention (was ~5% fail).