Skip to content

Update GPU architectures#10493

Merged
smuzaffar merged 2 commits into
cms-sw:IB/CMSSW_17_0_X/masterfrom
fwyzard:IB/CMSSW_17_0_X/master_update_GPU_archs
May 2, 2026
Merged

Update GPU architectures#10493
smuzaffar merged 2 commits into
cms-sw:IB/CMSSW_17_0_X/masterfrom
fwyzard:IB/CMSSW_17_0_X/master_update_GPU_archs

Conversation

@fwyzard
Copy link
Copy Markdown
Contributor

@fwyzard fwyzard commented Apr 18, 2026

NVIDIA CUDA

  • drop support for Pascal (sm 6.0);
  • add support for Blackwell (sm 10.0 and 12.0).

AMD ROCm

  • drop support for Instinct MI100 (gfx908) and Radeon Pro W6800 (gfx1030).

@fwyzard
Copy link
Copy Markdown
Contributor Author

fwyzard commented Apr 18, 2026

enable gpu

@fwyzard
Copy link
Copy Markdown
Contributor Author

fwyzard commented Apr 18, 2026

please test

@cmsbuild
Copy link
Copy Markdown
Contributor

A new Pull Request was created by @fwyzard for branch IB/CMSSW_17_0_X/master.

@akritkbehera, @iarspider, @raoatifshad, @smuzaffar can you please review it and eventually sign? Thanks.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 18, 2026

cms-bot internal usage

@fwyzard
Copy link
Copy Markdown
Contributor Author

fwyzard commented Apr 18, 2026

assign heterogeneous

@cmsbuild
Copy link
Copy Markdown
Contributor

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Copy Markdown
Contributor

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bd38d3/52733/summary.html
COMMIT: d121459
CMSSW: CMSSW_17_0_X_2026-04-17-2300/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/10493/52733/install.sh to create a dev area with all the needed externals and cmssw changes.

Failed External Build

I found compilation error when building:

WARNING: Target pattern parsing failed.
ERROR: no such package '@rules_java//java': java.io.IOException: Error downloading [https://github.com/bazelbuild/rules_java/releases/download/5.3.5/rules_java-5.3.5.tar.gz] to /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc13/external/tensorflow-sources_x86-64-v2/2.17.0-8889fad576923fd9a5d67c315d6f4715/build/86a8fa780a46a4b96ff787d7a11dc98a/external/rules_java/temp2800132884451372752/rules_java-5.3.5.tar.gz: GET returned 504 Gateway Time-out
INFO: Elapsed time: 27.949s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.pL9mow (%build)

RPM build errors:
Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.pL9mow (%build)

* The action "build-install-external+tensorflow-sources_x86-64-v2+2.17.0-8889fad576923fd9a5d67c315d6f4715" was not completed successfully because The following dependencies could not complete:


@fwyzard
Copy link
Copy Markdown
Contributor Author

fwyzard commented Apr 18, 2026

The error

WARNING: Download from https://github.com/bazelbuild/rules_java/releases/download/5.3.5/rules_java-5.3.5.tar.gz failed: class java.io.IOException GET returned 504 Gateway Time-out

looks like a transient issue, and pretty much unrelated to these changes.

@fwyzard
Copy link
Copy Markdown
Contributor Author

fwyzard commented Apr 18, 2026

please test

@fwyzard
Copy link
Copy Markdown
Contributor Author

fwyzard commented Apr 19, 2026

The RunInfo failures should be unrelated.

@cmsbuild
Copy link
Copy Markdown
Contributor

Pull request #10493 was updated.

@fwyzard fwyzard changed the title Update GPU architectures Update GPU software and architectures Apr 20, 2026
@fwyzard
Copy link
Copy Markdown
Contributor Author

fwyzard commented Apr 20, 2026

please test

@makortel
Copy link
Copy Markdown
Contributor

I know we need to do this eventually, but at this moment is there something specific that motivates dropping Volta? I see some number of 7.0 GPUs in the global pool at the moment (number-wise actually less than 6.X). I suppose we should run this through O&C management (I can do that on Wednesday). Given the small number of resources compared to >= 7.5 I would not expect objections.

@fwyzard
Copy link
Copy Markdown
Contributor Author

fwyzard commented Apr 20, 2026

Nothing urgent: we said we would do it after the end of Run 3 data taking, and the last data taking release should be 16.1.x.

The NVIDIA part could be further postponed.

@cmsbuild
Copy link
Copy Markdown
Contributor

-1

Failed Tests: RelVals-AMD_W7900
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bd38d3/52756/summary.html
COMMIT: 5929fde
CMSSW: CMSSW_17_0_X_2026-04-19-2300/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/10493/52756/install.sh to create a dev area with all the needed externals and cmssw changes.

Failed RelVals-AMD_W7900

The relvals timed out after 4 hours.

Comparison Summary

Summary:

  • You potentially added 13 lines to the logs
  • Reco comparison results: 6 differences found in the comparisons
  • DQMHistoTests: Total files compared: 53
  • DQMHistoTests: Total histograms compared: 4186813
  • DQMHistoTests: Total failures: 60
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4186733
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 60.999999999999964 KiB( 52 files compared)
  • DQMHistoSizes: changed ( 1000.0,... ): 1.525 KiB CSC/Summary
  • Checked 227 log files, 197 edm output root files, 53 DQM output files
  • TriggerResults: no differences found

NVIDIA_H100 Comparison Summary

Summary:

NVIDIA_L40S Comparison Summary

Summary:

@makortel
Copy link
Copy Markdown
Contributor

No clear objections were raised in the O&C meeting, but a question on "why V100 needs to be dropped now" was raised. Apparently there are V100 resources that "would be nice" to be kept supported as long as feasible.

One option would be to drop Pascal now, and Volta later when that is really necessary.

@fwyzard
Copy link
Copy Markdown
Contributor Author

fwyzard commented Apr 22, 2026

I don't mind keeping Volta.

I'm confused because a couple of months ago I was told that it was OK for O&C to drop Pascal and Volta. I suggested keeping them until the end of Run 3, and here we are 🤷🏻‍♂️

If we keep Volta, I would keep also Pascal for the time being. We will need to drop both relatively soon anyway, once we move to CUDA 13.x.

@fwyzard fwyzard changed the title Update GPU software and architectures Update GPU architectures Apr 30, 2026
fwyzard added 2 commits April 30, 2026 10:36
Drop support for Pascal (sm 6.0), add support for Blackwell (sm 10.0 and 12.0).
Drop support for Instinct MI100 (gfx908) and Radeon Pro W6800 (gfx1030).
@fwyzard fwyzard force-pushed the IB/CMSSW_17_0_X/master_update_GPU_archs branch from 98c6653 to db1d090 Compare April 30, 2026 08:37
@cmsbuild
Copy link
Copy Markdown
Contributor

Pull request #10493 was updated.

@fwyzard
Copy link
Copy Markdown
Contributor Author

fwyzard commented Apr 30, 2026

enable gpu

@fwyzard
Copy link
Copy Markdown
Contributor Author

fwyzard commented Apr 30, 2026

please test

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented May 1, 2026

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bd38d3/52982/summary.html
COMMIT: db1d090
CMSSW: CMSSW_17_0_X_2026-04-30-1100/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/10493/52982/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bd38d3/52982/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bd38d3/52982/git-merge-result

Comparison Summary

Summary:

  • You potentially removed 4 lines from the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 9 differences found in the comparisons
  • DQMHistoTests: Total files compared: 53
  • DQMHistoTests: Total histograms compared: 4187168
  • DQMHistoTests: Total failures: 33
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4187115
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 52 files compared)
  • Checked 227 log files, 197 edm output root files, 53 DQM output files
  • TriggerResults: found differences in 1 / 51 workflows

AMD_MI300X Comparison Summary

There are some workflows for which there are errors in the baseline:
34634.402 step 2
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

  • You potentially removed 69 lines from the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 284 differences found in the comparisons
  • DQMHistoTests: Total files compared: 12
  • DQMHistoTests: Total histograms compared: 200550
  • DQMHistoTests: Total failures: 33992
  • DQMHistoTests: Total nulls: 26
  • DQMHistoTests: Total successes: 166532
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 11 files compared)
  • Checked 47 log files, 48 edm output root files, 12 DQM output files
  • TriggerResults: no differences found

AMD_W7900 Comparison Summary

Summary:

NVIDIA_H100 Comparison Summary

Summary:

NVIDIA_L40S Comparison Summary

Summary:

@fwyzard
Copy link
Copy Markdown
Contributor Author

fwyzard commented May 2, 2026

+heterogeneous

@smuzaffar
Copy link
Copy Markdown
Contributor

+externals

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented May 2, 2026

This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_17_0_X/master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @mandrenguyen, @sextonkennedy, @ftenchini (and backports should be raised in the release meeting by the corresponding L2)

@smuzaffar smuzaffar merged commit 5d4ff25 into cms-sw:IB/CMSSW_17_0_X/master May 2, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants