Enhancement - Fix power monitoring for ROCm 6.3#794
Conversation
There was a problem hiding this comment.
Pull request overview
Improves ROCm GPU power monitoring robustness by handling unsupported power fields returned by amdsmi_get_power_info() on newer hardware (e.g., MI300X) and normalizing power_limit units so monitoring doesn’t crash and reports correct watts.
Changes:
- Update ROCm
get_device_power()to fall back fromaverage_socket_powertocurrent_socket_powerwhen the former is unsupported ('N/A'). - Update ROCm
get_device_power_limit()to treatpower_limitas microwatts (when applicable) and convert to watts.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #794 +/- ##
==========================================
+ Coverage 85.70% 85.88% +0.17%
==========================================
Files 103 103
Lines 7907 7929 +22
==========================================
+ Hits 6777 6810 +33
+ Misses 1130 1119 -11
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Use numbers.Real instead of (int, float) so numeric types like numpy scalars and ctypes ints are accepted, while still rejecting bool. - Explicitly check for missing keys in amdsmi power_info and log a warning so monitoring failures are diagnosable instead of silently returning None. - Replace the magic threshold 100000 with named constants _AMDSMI_MICROWATTS_PER_WATT and _AMDSMI_MICROWATTS_THRESHOLD, with a clearer comment explaining the µW vs W detection heuristic. - Add unit tests covering: average_socket_power supported, fallback to current_socket_power on 'N/A', both unsupported -> None, missing keys -> None, power_limit µW->W conversion, watts passthrough, non-numeric power_limit -> None, and missing power_limit key -> None.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
tests/common/test_device_manager.py:31
test_nvidia_helper_utilsis only guarded by@decorator.cuda_test(SB_TEST_CUDA), but it will still run (and fail) on hosts without NVIDIA GPUs (e.g., AMD-only or CPU-only), becausedm.device_managerwill not be aNvidiaDeviceManagerand methods likeget_device_compute_capability()returnNone. Gate this test on GPU vendor/availability (or split into CUDA vs ROCm tests using the existingrocm_testdecorator).
@decorator.cuda_test
@mock.patch('superbench.common.utils.process.run_command')
def test_nvidia_helper_utils(mock_run_command):
"""Test util functions of nvidia_helper."""
assert (isinstance(dm.device_manager.get_device_count(), numbers.Number))
assert (isinstance(dm.device_manager.get_device_compute_capability(), numbers.Number))
assert (isinstance(dm.device_manager.get_device_utilization(0), numbers.Number))
assert (isinstance(dm.device_manager.get_device_temperature(0), numbers.Number))
assert (isinstance(dm.device_manager.get_device_power_limit(0), numbers.Number))
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Source: - Extract _amdsmi_power_to_watts helper that applies the microwatts->watts heuristic uniformly. This makes get_device_power and get_device_power_limit symmetric: previously the limit was µW-corrected but the live power was not, so on amdsmi versions reporting power in µW the monitored gpu_power could be off from gpu_power_limit by a factor of 1e6. - Make AmdDeviceManager.__del__ defensive: swallow exceptions from amdsmi_shut_down so interpreter-shutdown / partial-import scenarios cannot raise (e.g., NameError when the module global has been torn down). This is the proper place for the fix that the test was previously working around via sys.modules injection. Tests: - Drop the sys.modules-level rocml injection (no longer needed once __del__ is safe). - Add test_amd_get_device_power_microwatts_converted to lock in the symmetric unit handling for get_device_power. - Add test_amd_device_manager_lifecycle exercising __init__/__del__ through normal construction and verifying __del__ tolerates a raising amdsmi_shut_down.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Resolves the three remaining unresolved Copilot inline review comments on PR #794: - 3177654095: module docstring at the top still said "nvidia_helper" even though this file now also covers the AmdDeviceManager path. Reword to mention both backends so the docstring matches the file content (and the previously-promised reply from 4184466765). - 3177654090 / 3236219521 (same issue, re-flagged on HEAD): there was only one blank line between test_amd_get_device_power_microwatts_converted and test_amd_get_device_power_limit_microwatts_converted, violating flake8 E302 (two blank lines expected between top-level defs). Insert one blank line to match every other inter-function gap in the file. Verification: python -m flake8 superbench/common/utils/device_manager.py tests/common/test_device_manager.py exits 0; all 10 AMD unit tests still pass (the pre-existing test_nvidia_helper_utils failure on CPU-only hosts is unrelated to this PR and pre-dates these changes).
Description
GPU monitor crashes with invalid literal for int() with base 10: 'N/A' — amdsmi_get_power_info() returns 'N/A' for average_socket_power on MI300X (hardware reports 0xFFFF for unsupported fields). The
code called int() on this value without validation. Additionally, power_limit is returned in microwatts but was treated as watts.
Changes
get_device_power(): Fall back to current_socket_power when average_socket_power is 'N/A'. get_device_power_limit(): Convert microwatts to watts when needed.