Benchmark: Micro benchmark - Add float datatype support and other refinements to GPU Stream#769
Benchmark: Micro benchmark - Add float datatype support and other refinements to GPU Stream#769WenqingLan1 wants to merge 22 commits into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #769 +/- ##
=======================================
Coverage 85.69% 85.69%
=======================================
Files 103 103
Lines 7890 7891 +1
=======================================
+ Hits 6761 6762 +1
Misses 1129 1129
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Updates the GPU STREAM microbenchmark to support runtime-selectable FP32/FP64 execution and improve GPU memory bandwidth utilization, while aligning SuperBench integration (CLI, output tags, docs, and tests) to the new behavior.
Changes:
- Add
--data_type <float|double>to select FP32/FP64 at runtime and propagate it through the Python benchmark wrapper + unit tests. - Refactor CUDA kernels to use 128-bit vectorized accesses (
double2/float4) and move template kernel implementations into a header for cross-TU instantiation. - Adjust execution/output to single visible GPU (device 0 via
CUDA_VISIBLE_DEVICES) and update metric/tag formats (removinggpu_id) plus docs/examples/test log.
Reviewed changes
Copilot reviewed 11 out of 13 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
tests/data/gpu_stream.log |
Updates golden log output to include data type and new tag format (no gpu_id). |
tests/benchmarks/micro_benchmarks/test_gpu_stream.py |
Extends command-generation assertions to include --data_type (currently only covers double). |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.hpp |
Removes NUMA/GPU iteration fields from args and adds Opts::data_type. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp |
Adds CLI parsing/printing for --data_type. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_main.cpp |
New entry point replacing the previous main file. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp |
Introduces vector-type mapping and templated kernel definitions (128-bit loads/stores). |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.cu |
Keeps a CUDA compilation unit and moves template implementations to the header. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.hpp |
Expands bench-args variant to support float and double. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu |
Uses local NUMA allocation, enforces 16B/thread sizing, launches templated vectorized kernels, updates tag format, and runs only CUDA device 0. |
superbench/benchmarks/micro_benchmarks/gpu_stream/CMakeLists.txt |
Switches target sources to the new gpu_stream_main.cpp. |
superbench/benchmarks/micro_benchmarks/gpu_stream.py |
Adds --data_type argument and forwards it to the binary. |
examples/benchmarks/gpu_stream.py |
Updates example invocation to include --data_type double. |
docs/user-tutorial/benchmarks/micro-benchmarks.md |
Updates gpu-stream metric patterns to include `(double |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 14 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 13 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:99
- ParseOpts initializes size_specified=true, which makes --size effectively optional, but PrintUsage presents --size as required and the end-of-parse validation still checks size_specified. Either initialize size_specified=false to enforce explicit --size, or update the usage/validation logic to reflect that the default buffer size is acceptable.
int getopt_ret = 0;
int opt_idx = 0;
bool size_specified = true;
bool num_warm_up_specified = false;
bool num_loops_specified = false;
bool parse_err = false;
while (true) {
getopt_ret = getopt_long(argc, argv, "", options, &opt_idx);
if (getopt_ret == -1) {
if (!size_specified || !num_warm_up_specified || !num_loops_specified) {
parse_err = true;
}
break;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…-stream # Conflicts: # tests/benchmarks/micro_benchmarks/test_gpu_stream.py
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 13 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:99
ParseOptssetssize_specifiedtotrueinitially, which makes the required-argument validation (if (!size_specified || ...)) ineffective for--size. Either initializesize_specifiedtofalse(to truly require--size) or removesize_specifiedfrom the required check if--sizeis intended to be optional via the default.
bool size_specified = true;
bool num_warm_up_specified = false;
bool num_loops_specified = false;
bool parse_err = false;
while (true) {
getopt_ret = getopt_long(argc, argv, "", options, &opt_idx);
if (getopt_ret == -1) {
if (!size_specified || !num_warm_up_specified || !num_loops_specified) {
parse_err = true;
}
break;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Superseded by updated review (model attribution removed).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 13 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:96
- In ParseOpts,
size_specifiedis initialized totrueand only ever set totrue, so the final required-args checkif (!size_specified || !num_warm_up_specified || !num_loops_specified)can never fail due to a missing--size. This is inconsistent with the usage string (which shows--sizeas required) and makes the variable effectively dead/misleading. Initializesize_specifiedtofalse(and set ittrueonly when--sizeis parsed) if--sizeis required, or removesize_specifiedfrom the required-args validation (and update the usage message) if the default size is intended to be allowed.
int getopt_ret = 0;
int opt_idx = 0;
bool size_specified = true;
bool num_warm_up_specified = false;
bool num_loops_specified = false;
bool parse_err = false;
while (true) {
getopt_ret = getopt_long(argc, argv, "", options, &opt_idx);
if (getopt_ret == -1) {
if (!size_specified || !num_warm_up_specified || !num_loops_specified) {
| #pragma once | ||
|
|
||
| #include <cuda.h> | ||
| #include <cuda_runtime.h> | ||
|
|
||
| #include "gpu_stream_utils.hpp" | ||
| constexpr auto kNumLoopUnrollAlias = stream_config::kNumLoopUnroll; | ||
|
|
||
| // Function declarations | ||
| template <typename T> inline __device__ void Fetch(T &v, const T *p); | ||
| template <typename T> inline __device__ void Store(T *p, const T &v); | ||
| /** | ||
| * @brief Type trait mapping scalar types to their 128-bit aligned vector types. |
| // Pin the thread to its local NUMA node to prevent migration, | ||
| // ensuring numa_alloc_local buffers remain node-local. | ||
| int cpu = sched_getcpu(); | ||
| if (cpu < 0) { | ||
| std::cerr << "Run::sched_getcpu failed" << std::endl; | ||
| return -1; | ||
| } |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 14 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:99
size_specifiedis initialized totrue, which makes the!size_specifiedcheck at end-of-parse ineffective and inconsistent with the usage text that treats--sizeas required. Consider initializing it tofalse(and requiring--size), or removing the flag/check entirely if--sizeis meant to be optional due to the default.
int getopt_ret = 0;
int opt_idx = 0;
bool size_specified = true;
bool num_warm_up_specified = false;
bool num_loops_specified = false;
bool parse_err = false;
while (true) {
getopt_ret = getopt_long(argc, argv, "", options, &opt_idx);
if (getopt_ret == -1) {
if (!size_specified || !num_warm_up_specified || !num_loops_specified) {
parse_err = true;
}
break;
| // Kernel declarations (visible to all compilers for function pointer usage) | ||
| template <typename T> __global__ void CopyKernel(VecT<T> *tgt, const VecT<T> *src); | ||
| template <typename T> __global__ void ScaleKernel(VecT<T> *tgt, const VecT<T> *src, const T scalar); | ||
| template <typename T> __global__ void AddKernel(VecT<T> *tgt, const VecT<T> *src_a, const VecT<T> *src_b); | ||
| template <typename T> | ||
| __global__ void TriadKernel(VecT<T> *tgt, const VecT<T> *src_a, const VecT<T> *src_b, const T scalar); |
| template <typename T> inline __device__ void Fetch(T &v, const T *p) { | ||
| #if defined(__HIP_PLATFORM_HCC__) || defined(__HCC__) || defined(__HIPCC__) | ||
| v = *p; | ||
| #else | ||
| if constexpr (std::is_same<T, double2>::value) { | ||
| asm volatile("ld.volatile.global.v2.f64 {%0,%1}, [%2];" : "=d"(v.x), "=d"(v.y) : "l"(p) : "memory"); | ||
| } else if constexpr (std::is_same<T, float4>::value) { | ||
| asm volatile("ld.volatile.global.v4.f32 {%0,%1,%2,%3}, [%4];" |
| for (int j = 0; j < args->sub.times_in_ms[i].size(); j++) { | ||
| // STREAM_<Kernelname>_datatype_buffer_<buffer_size>_block_<block_size> | ||
| for (size_t i = 0; i < args->sub.times_in_ms.size(); i++) { | ||
| std::string tag = "STREAM_" + KernelToString(i) + "_" + data_type + "_buffer_" + std::to_string(args->size); |
| help='Enable data checking. Note: allocates 2x --size bytes of host memory per process ' | ||
| 'for validation buffers (e.g. 8 GiB with default 4 GiB --size). ' |
| Measure the memory bandwidth of GPU using the STREAM benchmark. The benchmark tests various memory operations including copy, scale, add, and triad for double datatype. | ||
| Measure the memory bandwidth of GPU using the STREAM benchmark. The benchmark tests various memory operations including copy, scale, add, and triad for double and float datatypes. | ||
|
|
||
| __Note__: When `--check_data` is enabled, each process allocates 2× `--size` bytes of host memory for validation buffers (e.g. 8 GiB with the default 4 GiB `--size`). Under `default_local_mode` with 8 GPUs this totals ~64 GiB of host RAM. Recommend using a small `--size` such as `1048576` (1 MiB) when `--check_data` is enabled. |
| assert (output_key.strip('_bw') in test_raw_output_dict) | ||
| assert (test_raw_output_dict[output_key.strip('_bw')][0] == benchmark.result[output_key][0]) | ||
| else: | ||
| assert (output_key.strip('_ratio') in test_raw_output_dict) | ||
| assert (test_raw_output_dict[output_key.strip('_ratio')][1] == benchmark.result[output_key][0]) |
| } | ||
|
|
||
| if (ret != 0) { | ||
| std::cerr << "Run::RunStream error: " << errno << std::endl; |
Refinements:
Note: metric tag removed gpu_idx and the execution is per-process, so users need to update the configs & rules.
New config:
New rule:
Example results:
Processed by rules: