Update GPU architectures#10493
Conversation
|
enable gpu |
|
please test |
|
A new Pull Request was created by @fwyzard for branch IB/CMSSW_17_0_X/master. @akritkbehera, @iarspider, @raoatifshad, @smuzaffar can you please review it and eventually sign? Thanks. |
|
cms-bot internal usage |
|
assign heterogeneous |
|
-1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bd38d3/52733/summary.html Failed External BuildI found compilation error when building: WARNING: Target pattern parsing failed. ERROR: no such package '@rules_java//java': java.io.IOException: Error downloading [https://github.com/bazelbuild/rules_java/releases/download/5.3.5/rules_java-5.3.5.tar.gz] to /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc13/external/tensorflow-sources_x86-64-v2/2.17.0-8889fad576923fd9a5d67c315d6f4715/build/86a8fa780a46a4b96ff787d7a11dc98a/external/rules_java/temp2800132884451372752/rules_java-5.3.5.tar.gz: GET returned 504 Gateway Time-out INFO: Elapsed time: 27.949s INFO: 0 processes. FAILED: Build did NOT complete successfully (0 packages loaded) error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.pL9mow (%build) RPM build errors: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.pL9mow (%build) * The action "build-install-external+tensorflow-sources_x86-64-v2+2.17.0-8889fad576923fd9a5d67c315d6f4715" was not completed successfully because The following dependencies could not complete: |
|
The error looks like a transient issue, and pretty much unrelated to these changes. |
|
please test |
|
The |
|
Pull request #10493 was updated. |
|
please test |
|
I know we need to do this eventually, but at this moment is there something specific that motivates dropping Volta? I see some number of 7.0 GPUs in the global pool at the moment (number-wise actually less than 6.X). I suppose we should run this through O&C management (I can do that on Wednesday). Given the small number of resources compared to >= 7.5 I would not expect objections. |
|
Nothing urgent: we said we would do it after the end of Run 3 data taking, and the last data taking release should be 16.1.x. The NVIDIA part could be further postponed. |
|
-1 Failed Tests: RelVals-AMD_W7900 Failed RelVals-AMD_W7900The relvals timed out after 4 hours. Comparison SummarySummary:
NVIDIA_H100 Comparison SummarySummary:
NVIDIA_L40S Comparison SummarySummary:
|
|
No clear objections were raised in the O&C meeting, but a question on "why V100 needs to be dropped now" was raised. Apparently there are V100 resources that "would be nice" to be kept supported as long as feasible. One option would be to drop Pascal now, and Volta later when that is really necessary. |
|
I don't mind keeping Volta. I'm confused because a couple of months ago I was told that it was OK for O&C to drop Pascal and Volta. I suggested keeping them until the end of Run 3, and here we are 🤷🏻♂️ If we keep Volta, I would keep also Pascal for the time being. We will need to drop both relatively soon anyway, once we move to CUDA 13.x. |
Drop support for Pascal (sm 6.0), add support for Blackwell (sm 10.0 and 12.0).
Drop support for Instinct MI100 (gfx908) and Radeon Pro W6800 (gfx1030).
98c6653 to
db1d090
Compare
|
Pull request #10493 was updated. |
|
enable gpu |
|
please test |
|
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bd38d3/52982/summary.html The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:
You can see more details here: Comparison SummarySummary:
AMD_MI300X Comparison SummaryThere are some workflows for which there are errors in the baseline: Summary:
AMD_W7900 Comparison SummarySummary:
NVIDIA_H100 Comparison SummarySummary:
NVIDIA_L40S Comparison SummarySummary:
|
|
+heterogeneous |
|
+externals |
|
This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_17_0_X/master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @mandrenguyen, @sextonkennedy, @ftenchini (and backports should be raised in the release meeting by the corresponding L2) |
NVIDIA CUDA
AMD ROCm