Update GPU architectures by fwyzard · Pull Request #10493 · cms-sw/cmsdist

fwyzard · 2026-04-18T10:35:36Z

NVIDIA CUDA

drop support for Pascal (sm 6.0);
add support for Blackwell (sm 10.0 and 12.0).

AMD ROCm

drop support for Instinct MI100 (gfx908) and Radeon Pro W6800 (gfx1030).

fwyzard · 2026-04-18T10:35:47Z

enable gpu

fwyzard · 2026-04-18T10:35:51Z

please test

cmsbuild · 2026-04-18T10:35:57Z

A new Pull Request was created by @fwyzard for branch IB/CMSSW_17_0_X/master.

@akritkbehera, @iarspider, @raoatifshad, @smuzaffar can you please review it and eventually sign? Thanks.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here

cmsbuild · 2026-04-18T10:35:58Z

cms-bot internal usage

fwyzard · 2026-04-18T10:36:09Z

assign heterogeneous

cmsbuild · 2026-04-18T10:36:11Z

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild · 2026-04-18T17:32:53Z

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bd38d3/52733/summary.html
COMMIT: d121459
CMSSW: CMSSW_17_0_X_2026-04-17-2300/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/10493/52733/install.sh to create a dev area with all the needed externals and cmssw changes.

Failed External Build

I found compilation error when building:

WARNING: Target pattern parsing failed.
ERROR: no such package '@rules_java//java': java.io.IOException: Error downloading [https://github.com/bazelbuild/rules_java/releases/download/5.3.5/rules_java-5.3.5.tar.gz] to /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc13/external/tensorflow-sources_x86-64-v2/2.17.0-8889fad576923fd9a5d67c315d6f4715/build/86a8fa780a46a4b96ff787d7a11dc98a/external/rules_java/temp2800132884451372752/rules_java-5.3.5.tar.gz: GET returned 504 Gateway Time-out
INFO: Elapsed time: 27.949s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.pL9mow (%build)

RPM build errors:
Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.pL9mow (%build)

* The action "build-install-external+tensorflow-sources_x86-64-v2+2.17.0-8889fad576923fd9a5d67c315d6f4715" was not completed successfully because The following dependencies could not complete:

fwyzard · 2026-04-18T20:45:36Z

The error

WARNING: Download from https://github.com/bazelbuild/rules_java/releases/download/5.3.5/rules_java-5.3.5.tar.gz failed: class java.io.IOException GET returned 504 Gateway Time-out

looks like a transient issue, and pretty much unrelated to these changes.

fwyzard · 2026-04-18T20:45:39Z

please test

fwyzard · 2026-04-19T16:56:55Z

The RunInfo failures should be unrelated.

cmsbuild · 2026-04-20T04:40:26Z

Pull request #10493 was updated.

fwyzard · 2026-04-20T04:42:11Z

please test

makortel · 2026-04-20T13:43:59Z

I know we need to do this eventually, but at this moment is there something specific that motivates dropping Volta? I see some number of 7.0 GPUs in the global pool at the moment (number-wise actually less than 6.X). I suppose we should run this through O&C management (I can do that on Wednesday). Given the small number of resources compared to >= 7.5 I would not expect objections.

fwyzard · 2026-04-20T13:46:25Z

Nothing urgent: we said we would do it after the end of Run 3 data taking, and the last data taking release should be 16.1.x.

The NVIDIA part could be further postponed.

cmsbuild · 2026-04-21T00:03:51Z

-1

Failed Tests: RelVals-AMD_W7900
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bd38d3/52756/summary.html
COMMIT: 5929fde
CMSSW: CMSSW_17_0_X_2026-04-19-2300/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/10493/52756/install.sh to create a dev area with all the needed externals and cmssw changes.

Failed RelVals-AMD_W7900

The relvals timed out after 4 hours.

Comparison Summary

Summary:

You potentially added 13 lines to the logs
Reco comparison results: 6 differences found in the comparisons
DQMHistoTests: Total files compared: 53
DQMHistoTests: Total histograms compared: 4186813
DQMHistoTests: Total failures: 60
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 4186733
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 60.999999999999964 KiB( 52 files compared)
DQMHistoSizes: changed ( 1000.0,... ): 1.525 KiB CSC/Summary
Checked 227 log files, 197 edm output root files, 53 DQM output files
TriggerResults: no differences found

NVIDIA_H100 Comparison Summary

Summary:

You potentially removed 29 lines from the logs
Reco comparison results: 339 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216795
DQMHistoTests: Total failures: 35456
DQMHistoTests: Total nulls: 32
DQMHistoTests: Total successes: 181307
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: no differences found

NVIDIA_L40S Comparison Summary

Summary:

You potentially added 23 lines to the logs
Reco comparison results: 364 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216795
DQMHistoTests: Total failures: 32544
DQMHistoTests: Total nulls: 35
DQMHistoTests: Total successes: 184216
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: found differences in 1 / 12 workflows

makortel · 2026-04-22T13:31:27Z

No clear objections were raised in the O&C meeting, but a question on "why V100 needs to be dropped now" was raised. Apparently there are V100 resources that "would be nice" to be kept supported as long as feasible.

One option would be to drop Pascal now, and Volta later when that is really necessary.

fwyzard · 2026-04-22T13:48:36Z

I don't mind keeping Volta.

I'm confused because a couple of months ago I was told that it was OK for O&C to drop Pascal and Volta. I suggested keeping them until the end of Run 3, and here we are 🤷🏻‍♂️

If we keep Volta, I would keep also Pascal for the time being. We will need to drop both relatively soon anyway, once we move to CUDA 13.x.

Drop support for Pascal (sm 6.0), add support for Blackwell (sm 10.0 and 12.0).

Drop support for Instinct MI100 (gfx908) and Radeon Pro W6800 (gfx1030).

cmsbuild · 2026-04-30T08:37:53Z

Pull request #10493 was updated.

fwyzard · 2026-04-30T08:37:53Z

enable gpu

fwyzard · 2026-04-30T08:37:56Z

please test

cmsbuild · 2026-05-01T21:54:48Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bd38d3/52982/summary.html
COMMIT: db1d090
CMSSW: CMSSW_17_0_X_2026-04-30-1100/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/10493/52982/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bd38d3/52982/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bd38d3/52982/git-merge-result

Comparison Summary

Summary:

You potentially removed 4 lines from the logs
ROOTFileChecks: Some differences in event products or their sizes found
Reco comparison results: 9 differences found in the comparisons
DQMHistoTests: Total files compared: 53
DQMHistoTests: Total histograms compared: 4187168
DQMHistoTests: Total failures: 33
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 4187115
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 52 files compared)
Checked 227 log files, 197 edm output root files, 53 DQM output files
TriggerResults: found differences in 1 / 51 workflows

AMD_MI300X Comparison Summary

There are some workflows for which there are errors in the baseline:
34634.402 step 2
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

You potentially removed 69 lines from the logs
ROOTFileChecks: Some differences in event products or their sizes found
Reco comparison results: 284 differences found in the comparisons
DQMHistoTests: Total files compared: 12
DQMHistoTests: Total histograms compared: 200550
DQMHistoTests: Total failures: 33992
DQMHistoTests: Total nulls: 26
DQMHistoTests: Total successes: 166532
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 11 files compared)
Checked 47 log files, 48 edm output root files, 12 DQM output files
TriggerResults: no differences found

AMD_W7900 Comparison Summary

Summary:

You potentially added 67 lines to the logs
Reco comparison results: 322 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216259
DQMHistoTests: Total failures: 40584
DQMHistoTests: Total nulls: 31
DQMHistoTests: Total successes: 175644
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: found differences in 6 / 12 workflows

NVIDIA_H100 Comparison Summary

Summary:

You potentially removed 10 lines from the logs
Reco comparison results: 340 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216259
DQMHistoTests: Total failures: 32943
DQMHistoTests: Total nulls: 35
DQMHistoTests: Total successes: 183281
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: no differences found

NVIDIA_L40S Comparison Summary

Summary:

You potentially removed 13 lines from the logs
Reco comparison results: 337 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216259
DQMHistoTests: Total failures: 31457
DQMHistoTests: Total nulls: 26
DQMHistoTests: Total successes: 184776
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: no differences found

fwyzard · 2026-05-02T08:37:00Z

+heterogeneous

smuzaffar · 2026-05-02T21:35:45Z

+externals

cmsbuild · 2026-05-02T21:36:12Z

This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_17_0_X/master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @mandrenguyen, @sextonkennedy, @ftenchini (and backports should be raised in the release meeting by the corresponding L2)

cmsbuild added externals-pending pending-signatures tests-started orp-pending labels Apr 18, 2026

cmsbuild added the heterogeneous-pending label Apr 18, 2026

cmsbuild added tests-rejected and removed tests-started labels Apr 18, 2026

cmsbuild added tests-started and removed tests-rejected labels Apr 18, 2026

fwyzard changed the title ~~Update GPU architectures~~ Update GPU software and architectures Apr 20, 2026

fwyzard mentioned this pull request Apr 20, 2026

[amd] RelVals 29634.x timing out cms-sw/cmssw#49570

Open

cmsbuild added tests-rejected and removed tests-started labels Apr 21, 2026

cmsbuild added tests-rejected and removed tests-started labels Apr 30, 2026

fwyzard changed the title ~~Update GPU software and architectures~~ Update GPU architectures Apr 30, 2026

fwyzard added 2 commits April 30, 2026 10:36

Update supported CUDA architectures

0eb3531

Drop support for Pascal (sm 6.0), add support for Blackwell (sm 10.0 and 12.0).

Update supported ROCm architectures

db1d090

Drop support for Instinct MI100 (gfx908) and Radeon Pro W6800 (gfx1030).

fwyzard force-pushed the IB/CMSSW_17_0_X/master_update_GPU_archs branch from 98c6653 to db1d090 Compare April 30, 2026 08:37

cmsbuild added tests-pending and removed tests-rejected labels Apr 30, 2026

cmsbuild added tests-started and removed tests-pending labels Apr 30, 2026

This was referenced Apr 30, 2026

ONNXRuntime: update to version 1.25.1 #10516

Merged

Build C++ interface for torch extensions #10517

Merged

cmsbuild added tests-approved and removed tests-started labels May 1, 2026

cmsbuild added heterogeneous-approved and removed heterogeneous-pending labels May 2, 2026

cmsbuild added externals-approved fully-signed and removed externals-pending pending-signatures labels May 2, 2026

smuzaffar merged commit 5d4ff25 into cms-sw:IB/CMSSW_17_0_X/master May 2, 2026
20 checks passed

This was referenced May 3, 2026

[G4ADEPT] update AdePT specs to v0.3.3 #10476

Open

Test linking ROCM in scram.xml files #10513

Merged

Conversation

fwyzard commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

NVIDIA CUDA

AMD ROCm

Uh oh!

fwyzard commented Apr 18, 2026

Uh oh!

fwyzard commented Apr 18, 2026

Uh oh!

cmsbuild commented Apr 18, 2026

Uh oh!

cmsbuild commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fwyzard commented Apr 18, 2026

Uh oh!

cmsbuild commented Apr 18, 2026

Uh oh!

cmsbuild commented Apr 18, 2026

Failed External Build

Uh oh!

fwyzard commented Apr 18, 2026

Uh oh!

fwyzard commented Apr 18, 2026

Uh oh!

fwyzard commented Apr 19, 2026

Uh oh!

cmsbuild commented Apr 20, 2026

Uh oh!

fwyzard commented Apr 20, 2026

Uh oh!

makortel commented Apr 20, 2026

Uh oh!

fwyzard commented Apr 20, 2026

Uh oh!

cmsbuild commented Apr 21, 2026

Failed RelVals-AMD_W7900

Comparison Summary

NVIDIA_H100 Comparison Summary

NVIDIA_L40S Comparison Summary

Uh oh!

makortel commented Apr 22, 2026

Uh oh!

fwyzard commented Apr 22, 2026

Uh oh!

cmsbuild commented Apr 30, 2026

Uh oh!

fwyzard commented Apr 30, 2026

Uh oh!

fwyzard commented Apr 30, 2026

Uh oh!

cmsbuild commented May 1, 2026

Comparison Summary

AMD_MI300X Comparison Summary

AMD_W7900 Comparison Summary

NVIDIA_H100 Comparison Summary

NVIDIA_L40S Comparison Summary

Uh oh!

fwyzard commented May 2, 2026

Uh oh!

smuzaffar commented May 2, 2026

Uh oh!

cmsbuild commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fwyzard commented Apr 18, 2026 •

edited

Loading

cmsbuild commented Apr 18, 2026 •

edited

Loading