Skip to content

Benchmark: Micro benchmark - Add float datatype support and other refinements to GPU Stream#769

Open
WenqingLan1 wants to merge 22 commits into
microsoft:mainfrom
WenqingLan1:wenqinglan/refine-gpu-stream
Open

Benchmark: Micro benchmark - Add float datatype support and other refinements to GPU Stream#769
WenqingLan1 wants to merge 22 commits into
microsoft:mainfrom
WenqingLan1:wenqinglan/refine-gpu-stream

Conversation

@WenqingLan1
Copy link
Copy Markdown
Contributor

@WenqingLan1 WenqingLan1 commented Dec 19, 2025

Refinements:

  • Add support for float (fp32) execution and --data_type <float|double> CLI option for runtime type selection.
  • Refactor CUDA kernels to use 128-bit vectorized accesses (double2 / float4) and move template kernel implementations into a header for cross-TU instantiation. (Required for CUDA template instantiation across compilation units.)
  • Fix allocation buf size bug, args->size is buf size in bytes, not number of elements.
  • Adjust execution/output to single visible GPU (device 0 via CUDA_VISIBLE_DEVICES) and update metric/tag formats (removing gpu_id) plus docs/examples/test log.
  • Updated numa assignment from hard coded numa_alloc_onnode to numa_alloc_local to optimize memory allocation.
  • Rename entry point file from gpu_stream_test.cpp to gpu_stream_main.cpp.

Note: metric tag removed gpu_idx and the execution is per-process, so users need to update the configs & rules.
New config:

    gpu-stream:fp64:
      <<: *default_local_mode
      timeout: 600
      parameters:
        num_warm_up: 10
        num_loops: 40
        size: 1308622848
        data_type: double
    gpu-stream:fp64-correctness:
      <<: *default_local_mode
      timeout: 600
      parameters:
        num_warm_up: 0
        num_loops: 1
        size: 1048576
        data_type: double
        check_data: true
    gpu-stream:fp32:
      <<: *default_local_mode
      timeout: 600
      parameters:
        num_warm_up: 10
        num_loops: 40
        size: 654311424
        data_type: float
    gpu-stream:fp32-correctness:
      <<: *default_local_mode
      timeout: 600
      parameters:
        num_warm_up: 0
        num_loops: 1
        size: 1048576
        data_type: float
        check_data: true

New rule:

    gpu-stream:
      statistics:
        - mean
      categories: GPU-STREAM
      aggregate: True
      metrics:
        - gpu-stream:fp(?:32|64)/STREAM_.*_(?:bw|ratio):(\d+)

Example results:

"gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw:0": 1234, 
"gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw:1": 1234, 
"gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw:2": 1234, 
"gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw:3": 1234

Processed by rules:

| gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw | mean | 1234|

@WenqingLan1 WenqingLan1 requested a review from a team as a code owner December 19, 2025 20:05
@WenqingLan1 WenqingLan1 added the micro-benchmarks Micro Benchmark Test for SuperBench Benchmarks label Dec 19, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 19, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.69%. Comparing base (932d9f6) to head (fe00d1a).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #769   +/-   ##
=======================================
  Coverage   85.69%   85.69%           
=======================================
  Files         103      103           
  Lines        7890     7891    +1     
=======================================
+ Hits         6761     6762    +1     
  Misses       1129     1129           
Flag Coverage Δ
cpu-python3.10-unit-test 70.42% <50.00%> (+<0.01%) ⬆️
cpu-python3.12-unit-test 70.42% <50.00%> (+<0.01%) ⬆️
cpu-python3.7-unit-test 69.85% <50.00%> (+<0.01%) ⬆️
cuda-unit-test 83.60% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@guoshzhao guoshzhao self-assigned this Dec 19, 2025
Copilot AI review requested due to automatic review settings February 3, 2026 22:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the GPU STREAM microbenchmark to support runtime-selectable FP32/FP64 execution and improve GPU memory bandwidth utilization, while aligning SuperBench integration (CLI, output tags, docs, and tests) to the new behavior.

Changes:

  • Add --data_type <float|double> to select FP32/FP64 at runtime and propagate it through the Python benchmark wrapper + unit tests.
  • Refactor CUDA kernels to use 128-bit vectorized accesses (double2 / float4) and move template kernel implementations into a header for cross-TU instantiation.
  • Adjust execution/output to single visible GPU (device 0 via CUDA_VISIBLE_DEVICES) and update metric/tag formats (removing gpu_id) plus docs/examples/test log.

Reviewed changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/data/gpu_stream.log Updates golden log output to include data type and new tag format (no gpu_id).
tests/benchmarks/micro_benchmarks/test_gpu_stream.py Extends command-generation assertions to include --data_type (currently only covers double).
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.hpp Removes NUMA/GPU iteration fields from args and adds Opts::data_type.
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp Adds CLI parsing/printing for --data_type.
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_main.cpp New entry point replacing the previous main file.
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp Introduces vector-type mapping and templated kernel definitions (128-bit loads/stores).
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.cu Keeps a CUDA compilation unit and moves template implementations to the header.
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.hpp Expands bench-args variant to support float and double.
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Uses local NUMA allocation, enforces 16B/thread sizing, launches templated vectorized kernels, updates tag format, and runs only CUDA device 0.
superbench/benchmarks/micro_benchmarks/gpu_stream/CMakeLists.txt Switches target sources to the new gpu_stream_main.cpp.
superbench/benchmarks/micro_benchmarks/gpu_stream.py Adds --data_type argument and forwards it to the binary.
examples/benchmarks/gpu_stream.py Updates example invocation to include --data_type double.
docs/user-tutorial/benchmarks/micro-benchmarks.md Updates gpu-stream metric patterns to include `(double

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Outdated
Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Outdated
Comment thread docs/user-tutorial/benchmarks/micro-benchmarks.md Outdated
Copilot AI review requested due to automatic review settings February 6, 2026 00:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 14 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu
Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu
@guoshzhao guoshzhao requested a review from abuccts February 13, 2026 00:11
Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Outdated
Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu
Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp Outdated
Copilot AI review requested due to automatic review settings April 8, 2026 20:27
Copilot AI review requested due to automatic review settings April 9, 2026 21:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:99

  • ParseOpts initializes size_specified=true, which makes --size effectively optional, but PrintUsage presents --size as required and the end-of-parse validation still checks size_specified. Either initialize size_specified=false to enforce explicit --size, or update the usage/validation logic to reflect that the default buffer size is acceptable.
    int getopt_ret = 0;
    int opt_idx = 0;
    bool size_specified = true;
    bool num_warm_up_specified = false;
    bool num_loops_specified = false;

    bool parse_err = false;
    while (true) {
        getopt_ret = getopt_long(argc, argv, "", options, &opt_idx);
        if (getopt_ret == -1) {
            if (!size_specified || !num_warm_up_specified || !num_loops_specified) {
                parse_err = true;
            }
            break;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Outdated
Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream.py
@microsoft microsoft deleted a comment from Copilot AI Apr 9, 2026
…-stream

# Conflicts:
#	tests/benchmarks/micro_benchmarks/test_gpu_stream.py
Copilot AI review requested due to automatic review settings April 22, 2026 18:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:99

  • ParseOpts sets size_specified to true initially, which makes the required-argument validation (if (!size_specified || ...)) ineffective for --size. Either initialize size_specified to false (to truly require --size) or remove size_specified from the required check if --size is intended to be optional via the default.
    bool size_specified = true;
    bool num_warm_up_specified = false;
    bool num_loops_specified = false;

    bool parse_err = false;
    while (true) {
        getopt_ret = getopt_long(argc, argv, "", options, &opt_idx);
        if (getopt_ret == -1) {
            if (!size_specified || !num_warm_up_specified || !num_loops_specified) {
                parse_err = true;
            }
            break;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

polarG

This comment was marked as duplicate.

@polarG polarG dismissed their stale review May 13, 2026 22:56

Superseded by updated review (model attribution removed).

Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu
Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Outdated
Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Outdated
Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu
Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Outdated
Comment thread tests/data/gpu_stream.log
Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu
Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu
Copilot AI review requested due to automatic review settings May 20, 2026 18:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:96

  • In ParseOpts, size_specified is initialized to true and only ever set to true, so the final required-args check if (!size_specified || !num_warm_up_specified || !num_loops_specified) can never fail due to a missing --size. This is inconsistent with the usage string (which shows --size as required) and makes the variable effectively dead/misleading. Initialize size_specified to false (and set it true only when --size is parsed) if --size is required, or remove size_specified from the required-args validation (and update the usage message) if the default size is intended to be allowed.
    int getopt_ret = 0;
    int opt_idx = 0;
    bool size_specified = true;
    bool num_warm_up_specified = false;
    bool num_loops_specified = false;

    bool parse_err = false;
    while (true) {
        getopt_ret = getopt_long(argc, argv, "", options, &opt_idx);
        if (getopt_ret == -1) {
            if (!size_specified || !num_warm_up_specified || !num_loops_specified) {

Comment on lines 4 to +12
#pragma once

#include <cuda.h>
#include <cuda_runtime.h>

#include "gpu_stream_utils.hpp"
constexpr auto kNumLoopUnrollAlias = stream_config::kNumLoopUnroll;

// Function declarations
template <typename T> inline __device__ void Fetch(T &v, const T *p);
template <typename T> inline __device__ void Store(T *p, const T &v);
/**
* @brief Type trait mapping scalar types to their 128-bit aligned vector types.
Comment on lines 660 to 666
// Pin the thread to its local NUMA node to prevent migration,
// ensuring numa_alloc_local buffers remain node-local.
int cpu = sched_getcpu();
if (cpu < 0) {
std::cerr << "Run::sched_getcpu failed" << std::endl;
return -1;
}
Copilot AI review requested due to automatic review settings May 20, 2026 20:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 14 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:99

  • size_specified is initialized to true, which makes the !size_specified check at end-of-parse ineffective and inconsistent with the usage text that treats --size as required. Consider initializing it to false (and requiring --size), or removing the flag/check entirely if --size is meant to be optional due to the default.
    int getopt_ret = 0;
    int opt_idx = 0;
    bool size_specified = true;
    bool num_warm_up_specified = false;
    bool num_loops_specified = false;

    bool parse_err = false;
    while (true) {
        getopt_ret = getopt_long(argc, argv, "", options, &opt_idx);
        if (getopt_ret == -1) {
            if (!size_specified || !num_warm_up_specified || !num_loops_specified) {
                parse_err = true;
            }
            break;

Comment on lines +24 to +29
// Kernel declarations (visible to all compilers for function pointer usage)
template <typename T> __global__ void CopyKernel(VecT<T> *tgt, const VecT<T> *src);
template <typename T> __global__ void ScaleKernel(VecT<T> *tgt, const VecT<T> *src, const T scalar);
template <typename T> __global__ void AddKernel(VecT<T> *tgt, const VecT<T> *src_a, const VecT<T> *src_b);
template <typename T>
__global__ void TriadKernel(VecT<T> *tgt, const VecT<T> *src_a, const VecT<T> *src_b, const T scalar);
Comment on lines +48 to +55
template <typename T> inline __device__ void Fetch(T &v, const T *p) {
#if defined(__HIP_PLATFORM_HCC__) || defined(__HCC__) || defined(__HIPCC__)
v = *p;
#else
if constexpr (std::is_same<T, double2>::value) {
asm volatile("ld.volatile.global.v2.f64 {%0,%1}, [%2];" : "=d"(v.x), "=d"(v.y) : "l"(p) : "memory");
} else if constexpr (std::is_same<T, float4>::value) {
asm volatile("ld.volatile.global.v4.f32 {%0,%1,%2,%3}, [%4];"
for (int j = 0; j < args->sub.times_in_ms[i].size(); j++) {
// STREAM_<Kernelname>_datatype_buffer_<buffer_size>_block_<block_size>
for (size_t i = 0; i < args->sub.times_in_ms.size(); i++) {
std::string tag = "STREAM_" + KernelToString(i) + "_" + data_type + "_buffer_" + std::to_string(args->size);
Comment thread tests/benchmarks/micro_benchmarks/test_gpu_stream.py
Copilot AI review requested due to automatic review settings May 21, 2026 00:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 14 changed files in this pull request and generated 4 comments.

Comment on lines +66 to +67
help='Enable data checking. Note: allocates 2x --size bytes of host memory per process '
'for validation buffers (e.g. 8 GiB with default 4 GiB --size). '
Measure the memory bandwidth of GPU using the STREAM benchmark. The benchmark tests various memory operations including copy, scale, add, and triad for double datatype.
Measure the memory bandwidth of GPU using the STREAM benchmark. The benchmark tests various memory operations including copy, scale, add, and triad for double and float datatypes.

__Note__: When `--check_data` is enabled, each process allocates 2× `--size` bytes of host memory for validation buffers (e.g. 8 GiB with the default 4 GiB `--size`). Under `default_local_mode` with 8 GPUs this totals ~64 GiB of host RAM. Recommend using a small `--size` such as `1048576` (1 MiB) when `--check_data` is enabled.
Comment on lines +151 to +155
assert (output_key.strip('_bw') in test_raw_output_dict)
assert (test_raw_output_dict[output_key.strip('_bw')][0] == benchmark.result[output_key][0])
else:
assert (output_key.strip('_ratio') in test_raw_output_dict)
assert (test_raw_output_dict[output_key.strip('_ratio')][1] == benchmark.result[output_key][0])
}

if (ret != 0) {
std::cerr << "Run::RunStream error: " << errno << std::endl;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

micro-benchmarks Micro Benchmark Test for SuperBench Benchmarks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants