Update ROCm to version 7.1.0#10181
Conversation
|
enable gpu |
|
A new Pull Request was created by @fwyzard for branch IB/CMSSW_16_0_X/master. @akritkbehera, @cmsbuild, @iarspider, @smuzaffar can you please review it and eventually sign? Thanks. |
|
cms-bot internal usage |
|
please test |
|
please test for el9_amd64_gcc13 |
|
-1 Failed Tests: RelVals-AMD_MI300X The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic: You can see more details here: Failed RelVals-AMD_MI300XThe relvals timed out after 4 hours. Comparison SummarySummary:
NVIDIA_H100 Comparison SummarySummary:
NVIDIA_L40S Comparison SummarySummary:
NVIDIA_T4 Comparison SummarySummary:
|
|
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cfaec8/49307/summary.html The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic: You can see more details here: Comparison SummarySummary:
|
|
-1 Failed Tests: RelVals-AMD_W7900 The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic: You can see more details here: Failed RelVals-AMD_W7900The relvals timed out after 4 hours. Comparison SummarySummary:
AMD_MI300X Comparison SummarySummary:
NVIDIA_H100 Comparison SummarySummary:
NVIDIA_L40S Comparison SummarySummary:
NVIDIA_T4 Comparison SummarySummary:
|
|
@fwyzard , with ROCm 7.1.0, relval job is taking too much time for AMD W7900. Normal PR relvals take 20-25mins on but with ROCm 7.1.0 , it timed out after 4 hours. |
|
test parameters:
|
|
please test |
|
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cfaec8/49347/summary.html Comparison SummarySummary:
AMD_W7900 Comparison SummarySummary:
|
|
@fwyzard , this looks good. Let me know if you want to run some local tests defore we integrate it |
|
please test for CMSSW_16_0_ROOT636_X/el10_amd64_gcc14 |
|
please test for el9_amd64_gcc13 to refresh the build |
f0aa00e to
c8b2e4e
Compare
|
Pull request #10181 was updated. |
|
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cfaec8/49620/summary.html Comparison SummarySummary:
|
|
please test just to refresh the tests |
|
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cfaec8/49671/summary.html Comparison SummarySummary:
AMD_W7900 Comparison SummarySummary:
|
|
@fwyzard , let let us know when you are done with your tests |
|
Comparison of HLT workflows using ROCm versions 7.0.2 and 7.1.0 running on the NGT MI300X GPUs, while pinning 15 cores (pdf version):
The "full" workflow refers to running the full HLT offloading around 35% to GPUs, while the ECAL and pixel workflows refer to the corresponding GPU-only part of the HLT. I include below as an example the command run for the full workflow: ./patatrack-scripts/scan hlt.py \
-e 10300 \
--event-resolution 10 \
--event-skip 300 \
-r 4 \
--wait 10 \
-j 1 \
--steps 1 2 4 8 16 24 32 \
-s 0 \
--slot cpus=1-15:amd=0 \
--csv scan/hlt.csv \
-l logs |& tee logs/benchmark_hlt.log The full scan was observed to occasionally crash, but measurements were still possible given the 4 repetitions considered. Examples of the crashes: Running 4 times over 10300 events with 1 jobs, each with 24 threads, 0 streams, and 1 GPUs
51.9 ± 0.0 ev/s (10000 events)
51.4 ± 0.0 ev/s (10000 events)
The underlying cmsRun job was killed by signal 6
The last lines of the error log are:
Module: EcalUncalibRecHitProducerPortable@alpaka:hltEcalUncalibRecHitSoA
Module: EcalUncalibRecHitProducerPortable@alpaka:hltEcalUncalibRecHitSoA
Module: EcalUncalibRecHitProducerPortable@alpaka:hltEcalUncalibRecHitSoA
Module: none
Module: EcalUncalibRecHitProducerPortable@alpaka:hltEcalUncalibRecHitSoA
Module: none
Module: none
Module: noneRunning 4 times over 10300 events with 1 jobs, each with 2 threads, 0 streams, and 1 GPUs
10.0 ± 0.0 ev/s (10000 events)
The underlying cmsRun job was killed by signal
The last lines of the error log are:
The following is the call stack containing the origin of the signal.
Module: non-CMSSW (crashed)
Module: HcalDigisSoAProducer@alpaka:hltHcalDigisSoA
Module: none
A fatal system signal has occurred: abort signalThe latter is identical to the error messages received when running the HCAL-only workflow.
|
|
thanks @bfonta ! |
|
@smuzaffar I think we can merge the update |
|
@fwyzard , what about the crash[a] mentioned by @bfonta ? Should we try to integrate it first in DEVEL IBs and enable gpu tests for DEVEL IBs to see if every thing works ? |
|
No, I think we can go ahead and merge. The HCAL-only workflow crashes with ROCm 6.3, 6.4, 7.0 and 7.1 ... |
|
+externals lets get this in IBs for 16.0.0.pre3 |
|
This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_16_0_X/master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @ftenchini, @mandrenguyen (and backports should be raised in the release meeting by the corresponding L2) |
|
I've repeated the previous study only for ROCm
No crashes were observed. Conditions are identical to the previous ones:
I've checked whether the new driver version fixes the crashes observed when running the HGCAL-only GPU workflow, but unfortunately that is not the case. |


See the ROCm 7.1 release notes for the changes since ROCm 7.0.x.