Add some logging to GPU detection#6992
Conversation
even though we just successfully created a context. cuMemGetInfo() doesn't get this error. So use it. Note: I'm not sure if GetInfo() works for > 4GB; my GPU is a bit less than 4GB OpenCL: for generic devices, write FLOPS info to log file
There was a problem hiding this comment.
2 issues found across 3 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="client/gpu_nvidia.cpp">
<violation number="1" location="client/gpu_nvidia.cpp:601">
P1: This forced `false` disables the only safe fallback and can null-dereference `p_cuMemGetInfo` when only `cuMemGetInfo_v2` is exported.</violation>
<violation number="2" location="client/gpu_nvidia.cpp:610">
P2: `gpu_warning()` records a persistent warning, so this new success log will mark normal CUDA RAM probes as warnings.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
Pull request overview
This PR improves GPU detection diagnostics by adding more detailed logging around OpenCL generic device FLOPS estimation, timestamping GPU detection stderr output, and adjusting CUDA memory queries.
Changes:
- Log computed peak FLOPS for generic OpenCL devices (and warn/fallback when out of range).
- Add timestamps to GPU detection messages written via
gpu_warning(). - Switch NVIDIA available-RAM query path toward
cuMemGetInfo()(away from_v2) and add a detailed return-value log.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
client/gpu_opencl.cpp |
Adds detailed logging for generic OpenCL peak-FLOPS calculation and fallback behavior. |
client/gpu_nvidia.cpp |
Alters CUDA memory info function selection and logs cuMemGetInfo return/free/total details. |
client/gpu_detect.cpp |
Prefixes gpu_warning() stderr output with a timestamp. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| void gpu_warning(vector<string> &warnings, const char* msg) { | ||
| fprintf(stderr, "%s\n", msg); | ||
| fprintf(stderr, "[%s] %s\n", time_to_string(dtime()), msg); | ||
| warnings.push_back(msg); |
There was a problem hiding this comment.
gpu_warning() now calls time_to_string(dtime()), but this file doesn't include str_util.h where time_to_string() is declared. This will fail to compile; include str_util.h (or otherwise ensure time_to_string is declared) in this translation unit.
| // cuMemGetInfo_v2() always returns 201 (no context) | ||
| if (p_cuMemGetInfo) { | ||
| retval = (*p_cuMemGetInfo)(&memfree, &memtotal); | ||
| } | ||
| if (retval) { | ||
| snprintf(buf, sizeof(buf), | ||
| "[coproc] cuMemGetInfo(%d) returned %d", cc.device_num, retval | ||
| "cuMemGetInfo() returned %d; dev %d free %zu total %zu", | ||
| retval, cc.device_num, memfree, memtotal |
There was a problem hiding this comment.
The if (false && p_cuMemGetInfo_v2) branch disables cuMemGetInfo_v2() unconditionally. If only cuMemGetInfo_v2 is available (p_cuMemGetInfo is null), the else-branch will dereference a null function pointer and crash. Prefer selecting the available symbol (use v2 when v1 is missing), or try v2 first and fall back to v1 on CUDA_ERROR_INVALID_CONTEXT (201).
| snprintf(buf2, sizeof(buf2), | ||
| "OpenCL generic: peak FLOPS %f; Max units %u, max freq %u MHz", | ||
| prop.peak_flops, | ||
| prop.max_compute_units, prop.max_clock_frequency | ||
| ); | ||
| gpu_warning(warnings, buf2); | ||
| if (prop.peak_flops <= 0 || prop.peak_flops > GPU_MAX_PEAK_FLOPS) { | ||
| char buf2[256]; | ||
| snprintf(buf2, sizeof(buf2), | ||
| "OpenCL generic: bad peak FLOPS; Max units %u, max freq %u MHz", | ||
| prop.max_compute_units, prop.max_clock_frequency | ||
| ); | ||
| gpu_warning(warnings, buf2); | ||
| gpu_warning(warnings, "bad peak flops value; using default"); |
There was a problem hiding this comment.
This hunk introduces tab-indented lines (prop.peak_flops argument and the subsequent gpu_warning call), which are inconsistent with the surrounding space indentation and make the block harder to read/maintain. Replace the tabs with spaces to match the file’s existing indentation style.
| if (prop.peak_flops <= 0 || prop.peak_flops > GPU_MAX_PEAK_FLOPS) { | ||
| char buf2[256]; | ||
| snprintf(buf2, sizeof(buf2), | ||
| "OpenCL generic: bad peak FLOPS; Max units %u, max freq %u MHz", | ||
| prop.max_compute_units, prop.max_clock_frequency | ||
| ); | ||
| gpu_warning(warnings, buf2); | ||
| gpu_warning(warnings, "bad peak flops value; using default"); | ||
| prop.peak_flops = GPU_DEFAULT_PEAK_FLOPS; |
There was a problem hiding this comment.
The new warning message "bad peak flops value; using default" is ambiguous on its own (it doesn’t identify the device/vendor or the rejected value). Consider including at least the computed peak_flops (and/or vendor/name) so logs make sense when multiple devices are enumerated.
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="client/gpu_nvidia.cpp">
<violation number="1" location="client/gpu_nvidia.cpp:602">
P1: Handle `cuMemGetInfo*()` failures before updating available RAM; otherwise an error now turns `available_ram` into 0 instead of preserving the fallback total memory value.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| else { | ||
| // cuMemGetInfo_v2() always returns 201 (no context) | ||
| if (p_cuMemGetInfo) { | ||
| retval = (*p_cuMemGetInfo)(&memfree, &memtotal); |
There was a problem hiding this comment.
P1: Handle cuMemGetInfo*() failures before updating available RAM; otherwise an error now turns available_ram into 0 instead of preserving the fallback total memory value.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At client/gpu_nvidia.cpp, line 602:
<comment>Handle `cuMemGetInfo*()` failures before updating available RAM; otherwise an error now turns `available_ram` into 0 instead of preserving the fallback total memory value.</comment>
<file context>
@@ -598,19 +598,22 @@ static void get_available_nvidia_ram(COPROC_NVIDIA &cc, vector<string>& warnings
// cuMemGetInfo_v2() always returns 201 (no context)
- if (false && p_cuMemGetInfo_v2) {
+ if (p_cuMemGetInfo) {
+ retval = (*p_cuMemGetInfo)(&memfree, &memtotal);
+ snprintf(buf, sizeof(buf),
+ "cuMemGetInfo() returned %d; dev %d free %zu total %zu",
</file context>
I have nVidia GPU, I'll try to debug this. |
In particular, for a generic OpenCL device, log the calculation of its FLOPS.
Also: add timestamp to GPU detection log messages.
Also: for CUDA, use cuMemGetInfo() instead of cuMemGetInfo_v2().
On my computer, the latter always returns a 201 error
('not context', even though we just created a context).
Summary by cubic
Add timestamps to GPU detection logs and improve diagnostics for CUDA and OpenCL. Prefer CUDA cuMemGetInfo() with fallback and clearer logging to avoid context errors.
New Features
Bug Fixes
Written for commit b663abb. Summary will update on new commits.