Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
7f23c75
remove fixed gpu id & numa id assignment
WenqingLan1 Dec 18, 2025
d63fe8c
use 128bit alignment, add float support, cleanup
WenqingLan1 Dec 18, 2025
242714e
add data_type arg
WenqingLan1 Dec 19, 2025
e8d0282
fix lint
WenqingLan1 Dec 19, 2025
5a18946
fix clang lint
WenqingLan1 Dec 19, 2025
fddf56e
update doc
WenqingLan1 Dec 20, 2025
3c359a3
Merge branch 'main' into wenqinglan/refine-gpu-stream
WenqingLan1 Dec 22, 2025
e445363
Merge branch 'microsoft:main' into wenqinglan/refine-gpu-stream
WenqingLan1 Feb 3, 2026
60b130c
Merge branch 'microsoft:main' into wenqinglan/refine-gpu-stream
WenqingLan1 Feb 6, 2026
f31933f
fix alloc count & comment
WenqingLan1 Feb 6, 2026
d8a91ab
fix: reset gpu-burn submodule to correct commit
WenqingLan1 Feb 6, 2026
2dfa122
Merge branch 'microsoft:main' into wenqinglan/refine-gpu-stream
WenqingLan1 Apr 8, 2026
6dfdaa6
resolve comments
WenqingLan1 Apr 9, 2026
e3232f5
fix lint
WenqingLan1 Apr 9, 2026
58fead3
resolve comment
WenqingLan1 Apr 9, 2026
620a9fa
Merge remote-tracking branch 'origin/main' into wenqinglan/refine-gpu…
WenqingLan1 Apr 22, 2026
5cec42c
Merge branch 'microsoft:main' into wenqinglan/refine-gpu-stream
WenqingLan1 May 13, 2026
8c51d2f
Merge branch 'microsoft:main' into wenqinglan/refine-gpu-stream
WenqingLan1 May 20, 2026
01c7454
resolve comments
WenqingLan1 May 20, 2026
450a28d
refine doc
WenqingLan1 May 20, 2026
9cffad8
fix lint
WenqingLan1 May 21, 2026
fe00d1a
resolve comment
WenqingLan1 May 21, 2026
ea3fd8e
fix cuda11.1 build
WenqingLan1 May 21, 2026
2b6ea7e
fix doc
WenqingLan1 May 21, 2026
0fd405c
resolve comments
WenqingLan1 May 21, 2026
4173759
fix syntax
WenqingLan1 May 21, 2026
ee52086
fix lint
WenqingLan1 May 21, 2026
80d5f0a
resolve comment
WenqingLan1 May 21, 2026
ff9e254
Merge branch 'main' into wenqinglan/refine-gpu-stream
polarG May 22, 2026
5e887ff
fix nvmldevicegetnumanodeid error
WenqingLan1 May 22, 2026
18531e0
fix nvml call indx
WenqingLan1 May 22, 2026
b21284b
fix lint
WenqingLan1 May 22, 2026
a7f83d4
fix shadow ret
WenqingLan1 May 22, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 11 additions & 9 deletions docs/user-tutorial/benchmarks/micro-benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -267,20 +267,22 @@ For measurements of peer-to-peer communication performance between AMD GPUs, GPU

#### Introduction

Measure the memory bandwidth of GPU using the STREAM benchmark. The benchmark tests various memory operations including copy, scale, add, and triad for double datatype.
Measure the memory bandwidth of GPU using the STREAM benchmark. The benchmark tests various memory operations including copy, scale, add, and triad for double and float datatypes.

__Note__: When `--check_data` is enabled, each process allocates 6× `--size` bytes of host memory (data\_buf + check\_buf + 4 validation buffers, e.g. 24 GiB with the default 4 GiB `--size`). Under `default_local_mode` with 8 GPUs this totals ~192 GiB of host RAM. Recommend using a small `--size` such as `1048576` (1 MiB) when `--check_data` is enabled.

#### Metrics

| Metric Name | Unit | Description |
|------------------------------------------------------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| STREAM\_COPY\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the copy operation with specified buffer size and block size. |
| STREAM\_SCALE\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the scale operation with specified buffer size and block size. |
| STREAM\_ADD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the add operation with specified buffer size and block size. |
| STREAM\_TRIAD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the triad operation with specified buffer size and block size. |
| STREAM\_COPY\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the copy operation with specified buffer size and block size. |
| STREAM\_SCALE\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the scale operation with specified buffer size and block size. |
| STREAM\_ADD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the add operation with specified buffer size and block size. |
| STREAM\_TRIAD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the triad operation with specified buffer size and block size. |
| STREAM\_COPY\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The memory bandwidth of the GPU for the copy operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size. |
| STREAM\_SCALE\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The memory bandwidth of the GPU for the scale operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size. |
| STREAM\_ADD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The memory bandwidth of the GPU for the add operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size. |
| STREAM\_TRIAD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The memory bandwidth of the GPU for the triad operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size. |
| STREAM\_COPY\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The memory bandwidth efficiency of the GPU for the copy operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size. |
| STREAM\_SCALE\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The memory bandwidth efficiency of the GPU for the scale operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size. |
| STREAM\_ADD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The memory bandwidth efficiency of the GPU for the add operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size. |
| STREAM\_TRIAD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The memory bandwidth efficiency of the GPU for the triad operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size. |

### `ib-loopback`

Expand Down
2 changes: 1 addition & 1 deletion examples/benchmarks/gpu_stream.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

if __name__ == '__main__':
context = BenchmarkRegistry.create_benchmark_context(
'gpu-stream', platform=Platform.CUDA, parameters='--num_warm_up 1 --num_loops 10'
'gpu-stream', platform=Platform.CUDA, parameters='--num_warm_up 1 --num_loops 10 --data_type double'
)
# For ROCm environment, please specify the benchmark name and the platform as the following.
# context = BenchmarkRegistry.create_benchmark_context(
Expand Down
17 changes: 14 additions & 3 deletions superbench/benchmarks/micro_benchmarks/gpu_stream.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,10 +51,21 @@ def add_parser_arguments(self):
help='Number of data buffer copies performed.',
)

self._parser.add_argument(
'--data_type',
type=str,
default='double',
choices=['float', 'double'],
required=False,
help='Data type of the buffer elements.',
)

self._parser.add_argument(
'--check_data',
action='store_true',
help='Enable data checking',
help='Enable data checking. Note: allocates 6x --size bytes of host memory per process '
'(data_buf + check_buf + 4 validation buffers, e.g. 24 GiB with default 4 GiB --size). '
'Recommend using a small --size such as 1048576 (1 MiB) when this flag is enabled.',
)

def _preprocess(self):
Expand All @@ -68,8 +79,8 @@ def _preprocess(self):

self.__bin_path = os.path.join(self._args.bin_dir, self._bin_name)

args = '--size %d --num_warm_up %d --num_loops %d ' % (
self._args.size, self._args.num_warm_up, self._args.num_loops
args = '--size %d --num_warm_up %d --num_loops %d --data_type %s' % (
self._args.size, self._args.num_warm_up, self._args.num_loops, self._args.data_type
)

Comment thread
WenqingLan1 marked this conversation as resolved.
if self._args.check_data:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ message(STATUS "Found CUDA: " ${CUDAToolkit_VERSION})

# Source files
set(SOURCES
gpu_stream_test.cpp
gpu_stream_main.cpp
gpu_stream_utils.cpp
gpu_stream.cu
gpu_stream_kernels.cu
Expand All @@ -38,7 +38,8 @@ set(SOURCES
include(../cuda_common.cmake)
add_executable(gpu_stream ${SOURCES})
set_property(TARGET gpu_stream PROPERTY CUDA_ARCHITECTURES ${NVCC_ARCHS_SUPPORTED})
target_compile_definitions(gpu_stream PRIVATE _GNU_SOURCE)
target_include_directories(gpu_stream PRIVATE ${CUDAToolkit_INCLUDE_DIRS})
target_link_libraries(gpu_stream numa ${NVML_LIBRARY})
target_link_libraries(gpu_stream numa ${NVML_LIBRARY} ${CMAKE_DL_LIBS})

install(TARGETS gpu_stream RUNTIME DESTINATION bin)
Loading
Loading