Skip to content

compute-sanitizer reports invalid shared memory read in warpspeed DeviceScanKernel on Blackwell (sm_120a) #8288

@davidwendt

Description

@davidwendt

Is this a duplicate?

Type of Bug

Something else

Component

CUB

Describe the bug

Running compute-sanitizer --tool memcheck against a program that calls cub::DeviceScan::ExclusiveSum (or thrust::exclusive_scan) on Blackwell GPU (sm_120a) produces the following error:

  ========= Invalid __shared__ read of size 16 bytes
  =========     at cp_async_bulk_cp_mask in cp_async_bulk.h:236
  =========     by thread (128,0,0) in block (0,0,0)
  =========     Address 0x400 is out of bounds
  =========     Saved host backtrace up to driver entry point at kernel launch time
  =========   ...
  =========   in function decode_page_headers ...
  =========
  ========= ERROR SUMMARY: 1 error

A more full stack trace is available at this PR: rapidsai/cudf#21999

The access address is 0x400 (1024 bytes). The warpspeed kernel's dynamic shared memory allocation is 8064 bytes (smem_for_stages with policy {use_warpspeed=true, num_total_threads=352, items_per_thread=63, tile_size=8064}), so the address is in-bounds. The program produces correct results.

The error originates inside cub::detail::warpspeed::squadStoreBulkSync<int>cub::detail::scan::kernelBodydevice_scan_lookahead_bodysquadDispatchDeviceScanKernel.

How to Reproduce

The error occurs in a libcudf example run with compute-sanitizer and can be seen here: https://github.com/rapidsai/cudf/actions/runs/23918157779/job/69767002847?pr=21999#step:12:835
The PR rapidsai/cudf#21999 also contains additional information about reproducing the issue.

Unfortunately, we have not been able to create a standalone reproducer.

Expected behavior

compute-sanitizer to pass with no errors

Reproduction link

No response

Operating System

No response

nvidia-smi output

Fri Apr  3 17:11:46 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX PRO 6000 Blac...    Off |   00000000:3D:00.0 Off |                    0 |
| N/A   32C    P8             31W /  600W |       0MiB /  97887MiB |      0%      Default |
|                                         |                        |             Disabled |

NVCC version

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_Dec_16_07:23:41_PM_PST_2025
Cuda compilation tools, release 13.1, V13.1.115
Build cuda_13.1.r13.1/compiler.37061995_0

$ compute-sanitizer --version
NVIDIA (R) Compute Sanitizer
Copyright (c) 2020-2025 NVIDIA Corporation
Version 2025.4.1.0 (build 37093031) (public-release)

Also fails on
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2026 NVIDIA Corporation
Built on Mon_Mar_02_09:52:23_PM_PST_2026
Cuda compilation tools, release 13.2, V13.2.51
Build cuda_13.2.r13.2/compiler.37434383_0

$ compute-sanitizer --version
NVIDIA (R) Compute Sanitizer
Copyright (c) 2020-2026 NVIDIA Corporation
Version 2026.1.0.0 (build 37182542) (public-release)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions