Skip to content

Update ROCm to version 7.1.0#10181

Merged
smuzaffar merged 1 commit intocms-sw:IB/CMSSW_16_0_X/masterfrom
fwyzard:IB/CMSSW_16_0_X/master_rocm_710
Nov 27, 2025
Merged

Update ROCm to version 7.1.0#10181
smuzaffar merged 1 commit intocms-sw:IB/CMSSW_16_0_X/masterfrom
fwyzard:IB/CMSSW_16_0_X/master_rocm_710

Conversation

@fwyzard
Copy link
Copy Markdown
Contributor

@fwyzard fwyzard commented Nov 6, 2025

See the ROCm 7.1 release notes for the changes since ROCm 7.0.x.

@fwyzard
Copy link
Copy Markdown
Contributor Author

fwyzard commented Nov 6, 2025

enable gpu

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Nov 6, 2025

A new Pull Request was created by @fwyzard for branch IB/CMSSW_16_0_X/master.

@akritkbehera, @cmsbuild, @iarspider, @smuzaffar can you please review it and eventually sign? Thanks.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Nov 6, 2025

cms-bot internal usage

@fwyzard
Copy link
Copy Markdown
Contributor Author

fwyzard commented Nov 6, 2025

please test

@fwyzard
Copy link
Copy Markdown
Contributor Author

fwyzard commented Nov 6, 2025

please test for el9_amd64_gcc13

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Nov 7, 2025

-1

Failed Tests: RelVals-AMD_MI300X
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cfaec8/49306/summary.html
COMMIT: f0aa00e
CMSSW: CMSSW_16_0_X_2025-11-05-2300/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/10181/49306/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cfaec8/49306/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cfaec8/49306/git-merge-result

Failed RelVals-AMD_MI300X

The relvals timed out after 4 hours.

Comparison Summary

Summary:

  • You potentially removed 1 lines from the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 5 differences found in the comparisons
  • Reco comparison had 2 failed jobs
  • DQMHistoTests: Total files compared: 51
  • DQMHistoTests: Total histograms compared: 3939953
  • DQMHistoTests: Total failures: 77
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3939856
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
  • Checked 218 log files, 188 edm output root files, 51 DQM output files
  • TriggerResults: no differences found

NVIDIA_H100 Comparison Summary

Summary:

  • You potentially removed 3 lines from the logs
  • Reco comparison results: 233 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 147869
  • DQMHistoTests: Total failures: 23825
  • DQMHistoTests: Total nulls: 11
  • DQMHistoTests: Total successes: 124033
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

NVIDIA_L40S Comparison Summary

Summary:

  • You potentially added 4 lines to the logs
  • Reco comparison results: 253 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 147869
  • DQMHistoTests: Total failures: 23542
  • DQMHistoTests: Total nulls: 14
  • DQMHistoTests: Total successes: 124313
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

NVIDIA_T4 Comparison Summary

Summary:

  • You potentially added 7 lines to the logs
  • Reco comparison results: 220 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 147869
  • DQMHistoTests: Total failures: 32740
  • DQMHistoTests: Total nulls: 13
  • DQMHistoTests: Total successes: 115116
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: found differences in 2 / 10 workflows

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Nov 7, 2025

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cfaec8/49307/summary.html
COMMIT: f0aa00e
CMSSW: CMSSW_16_0_X_2025-11-05-2300/el9_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/10181/49307/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cfaec8/49307/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cfaec8/49307/git-merge-result

Comparison Summary

Summary:

  • You potentially added 190 lines to the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 68308 differences found in the comparisons
  • Reco comparison had 2 failed jobs
  • DQMHistoTests: Total files compared: 51
  • DQMHistoTests: Total histograms compared: 3939953
  • DQMHistoTests: Total failures: 367692
  • DQMHistoTests: Total nulls: 330
  • DQMHistoTests: Total successes: 3571911
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -10.084 KiB( 50 files compared)
  • DQMHistoSizes: changed ( 10224.0 ): -0.544 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 13034.0 ): -7.492 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 17034.0 ): 3.184 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 2024.0040001 ): 0.012 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 250202.181 ): 0.293 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 25202.0 ): 0.063 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 7.3 ): -5.600 KiB SiStrip/MechanicalView
  • Checked 218 log files, 188 edm output root files, 51 DQM output files

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Nov 7, 2025

-1

Failed Tests: RelVals-AMD_W7900
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cfaec8/49306/summary.html
COMMIT: f0aa00e
CMSSW: CMSSW_16_0_X_2025-11-05-2300/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/10181/49306/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cfaec8/49306/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cfaec8/49306/git-merge-result

Failed RelVals-AMD_W7900

The relvals timed out after 4 hours.

Comparison Summary

Summary:

  • You potentially removed 1 lines from the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 5 differences found in the comparisons
  • Reco comparison had 2 failed jobs
  • DQMHistoTests: Total files compared: 51
  • DQMHistoTests: Total histograms compared: 3939953
  • DQMHistoTests: Total failures: 77
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3939856
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
  • Checked 218 log files, 188 edm output root files, 51 DQM output files
  • TriggerResults: no differences found

AMD_MI300X Comparison Summary

Summary:

  • You potentially removed 4 lines from the logs
  • Reco comparison results: 242 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 147869
  • DQMHistoTests: Total failures: 27348
  • DQMHistoTests: Total nulls: 9
  • DQMHistoTests: Total successes: 120512
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

NVIDIA_H100 Comparison Summary

Summary:

  • You potentially removed 3 lines from the logs
  • Reco comparison results: 233 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 147869
  • DQMHistoTests: Total failures: 23825
  • DQMHistoTests: Total nulls: 11
  • DQMHistoTests: Total successes: 124033
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

NVIDIA_L40S Comparison Summary

Summary:

  • You potentially added 4 lines to the logs
  • Reco comparison results: 253 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 147869
  • DQMHistoTests: Total failures: 23542
  • DQMHistoTests: Total nulls: 14
  • DQMHistoTests: Total successes: 124313
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

NVIDIA_T4 Comparison Summary

Summary:

  • You potentially added 7 lines to the logs
  • Reco comparison results: 220 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 147869
  • DQMHistoTests: Total failures: 32740
  • DQMHistoTests: Total nulls: 13
  • DQMHistoTests: Total successes: 115116
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: found differences in 2 / 10 workflows

@smuzaffar
Copy link
Copy Markdown
Contributor

@fwyzard , with ROCm 7.1.0, relval job is taking too much time for AMD W7900. Normal PR relvals take 20-25mins on but with ROCm 7.1.0 , it timed out after 4 hours.

@fwyzard
Copy link
Copy Markdown
Contributor Author

fwyzard commented Nov 7, 2025

test parameters:

  • enable = gpu
  • gpu = amd_w7900

@fwyzard
Copy link
Copy Markdown
Contributor Author

fwyzard commented Nov 7, 2025

please test

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Nov 8, 2025

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cfaec8/49347/summary.html
COMMIT: f0aa00e
CMSSW: CMSSW_16_0_X_2025-11-07-1100/el8_amd64_gcc13
Additional Tests: GPU,AMD_W7900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/10181/49347/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially removed 3 lines from the logs
  • Reco comparison results: 6 differences found in the comparisons
  • Reco comparison had 2 failed jobs
  • DQMHistoTests: Total files compared: 51
  • DQMHistoTests: Total histograms compared: 3939953
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3939927
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
  • Checked 218 log files, 188 edm output root files, 51 DQM output files
  • TriggerResults: no differences found

AMD_W7900 Comparison Summary

Summary:

  • You potentially added 83 lines to the logs
  • Reco comparison results: 223 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 147869
  • DQMHistoTests: Total failures: 33538
  • DQMHistoTests: Total nulls: 11
  • DQMHistoTests: Total successes: 114320
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 125189.87399999998 KiB( 10 files compared)
  • DQMHistoSizes: changed ( 17034.402,... ): 20864.979 KiB HLT/HeterogeneousComparisons
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

@smuzaffar
Copy link
Copy Markdown
Contributor

@fwyzard , this looks good. Let me know if you want to run some local tests defore we integrate it

@smuzaffar
Copy link
Copy Markdown
Contributor

please test for CMSSW_16_0_ROOT636_X/el10_amd64_gcc14

@fwyzard
Copy link
Copy Markdown
Contributor Author

fwyzard commented Nov 23, 2025

please test for el9_amd64_gcc13

to refresh the build

@fwyzard fwyzard force-pushed the IB/CMSSW_16_0_X/master_rocm_710 branch from f0aa00e to c8b2e4e Compare November 23, 2025 07:47
@cmsbuild
Copy link
Copy Markdown
Contributor

Pull request #10181 was updated.

@cmsbuild
Copy link
Copy Markdown
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cfaec8/49620/summary.html
COMMIT: c8b2e4e
CMSSW: CMSSW_16_0_X_2025-11-21-2300/el9_amd64_gcc13
Additional Tests: GPU,AMD_W7900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/10181/49620/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 211 lines to the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 68178 differences found in the comparisons
  • Reco comparison had 2 failed jobs
  • DQMHistoTests: Total files compared: 51
  • DQMHistoTests: Total histograms compared: 3905633
  • DQMHistoTests: Total failures: 369191
  • DQMHistoTests: Total nulls: 336
  • DQMHistoTests: Total successes: 3536086
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -10.084 KiB( 50 files compared)
  • DQMHistoSizes: changed ( 10224.0 ): -0.544 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 13034.0 ): -7.492 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 17034.0 ): 3.184 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 2024.0040001 ): 0.012 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 250202.181 ): 0.293 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 25202.0 ): 0.063 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 7.3 ): -5.600 KiB SiStrip/MechanicalView
  • Checked 218 log files, 188 edm output root files, 51 DQM output files

@smuzaffar
Copy link
Copy Markdown
Contributor

please test

just to refresh the tests

@cmsbuild
Copy link
Copy Markdown
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cfaec8/49671/summary.html
COMMIT: c8b2e4e
CMSSW: CMSSW_16_0_X_2025-11-25-1100/el8_amd64_gcc13
Additional Tests: GPU,AMD_W7900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/10181/49671/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially removed 118 lines from the logs
  • Reco comparison results: 16 differences found in the comparisons
  • Reco comparison had 4 failed jobs
  • DQMHistoTests: Total files compared: 53
  • DQMHistoTests: Total histograms compared: 4268381
  • DQMHistoTests: Total failures: 89
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4268272
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -550.812 KiB( 52 files compared)
  • DQMHistoSizes: changed ( 16834.0,... ): -137.703 KiB HLT/JetMET
  • Checked 227 log files, 198 edm output root files, 53 DQM output files
  • TriggerResults: no differences found

AMD_W7900 Comparison Summary

Summary:

  • You potentially added 3 lines to the logs
  • Reco comparison results: 234 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 148187
  • DQMHistoTests: Total failures: 35962
  • DQMHistoTests: Total nulls: 8
  • DQMHistoTests: Total successes: 112217
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 10 files compared)
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

@smuzaffar
Copy link
Copy Markdown
Contributor

@fwyzard , let let us know when you are done with your tests

@bfonta
Copy link
Copy Markdown

bfonta commented Nov 26, 2025

Comparison of HLT workflows using ROCm versions 7.0.2 and 7.1.0 running on the NGT MI300X GPUs, while pinning 15 cores (pdf version):

performance_comparison

The "full" workflow refers to running the full HLT offloading around 35% to GPUs, while the ECAL and pixel workflows refer to the corresponding GPU-only part of the HLT.
The HCAL workflow is not included due to crashes when running the GPU-only workflow.

I include below as an example the command run for the full workflow:

./patatrack-scripts/scan hlt.py \                                                                                                                                                                                  
                         -e 10300 \                                                                                                                                                                                
                         --event-resolution 10 \                                                                                                                                                                   
                         --event-skip 300 \                                                                                                                                                                        
                         -r 4 \                                                                                                                                                                                    
                         --wait 10 \                                                                                                                                                                               
                         -j 1 \                                                                                                                                                                                    
                         --steps 1 2 4 8 16 24 32 \                                                                                                                                                                
                         -s 0 \                                                                                                                                                                                    
                         --slot cpus=1-15:amd=0 \                                                                                                                                                                  
                         --csv scan/hlt.csv \                                                                                                                                                                      
                         -l logs |& tee logs/benchmark_hlt.log 

The full scan was observed to occasionally crash, but measurements were still possible given the 4 repetitions considered. Examples of the crashes:

Running 4 times over 10300 events with 1 jobs, each with 24 threads, 0 streams, and 1 GPUs
    51.9 ±   0.0 ev/s (10000 events)
    51.4 ±   0.0 ev/s (10000 events)
The underlying cmsRun job was killed by signal 6

The last lines of the error log are:
Module: EcalUncalibRecHitProducerPortable@alpaka:hltEcalUncalibRecHitSoA
Module: EcalUncalibRecHitProducerPortable@alpaka:hltEcalUncalibRecHitSoA
Module: EcalUncalibRecHitProducerPortable@alpaka:hltEcalUncalibRecHitSoA
Module: none
Module: EcalUncalibRecHitProducerPortable@alpaka:hltEcalUncalibRecHitSoA
Module: none
Module: none
Module: none

Running 4 times over 10300 events with 1 jobs, each with 2 threads, 0 streams, and 1 GPUs
    10.0 ±   0.0 ev/s (10000 events)

The underlying cmsRun job was killed by signal                                                                
The last lines of the error log are:                          
The following is the call stack containing the origin of the signal.

Module: non-CMSSW (crashed)

Module: HcalDigisSoAProducer@alpaka:hltHcalDigisSoA                                                  
Module: none
A fatal system signal has occurred: abort signal

The latter is identical to the error messages received when running the HCAL-only workflow.


  • architecture: el9_amd64_gcc13
  • release: CMSSW_16_0_X_2025-11-21-2300
  • HLT menu: /frozen/2025/2e34/v1.3/CMSSW_15_1_X/HLT/V6
  • global tag: 150X_dataRun3_HLT_v1
  • input data: /shared/store/data/Run2025E/EphemeralHLTPhysics/FED/run396102/run396102_ls0295_index*.raw

@fwyzard
Copy link
Copy Markdown
Contributor Author

fwyzard commented Nov 26, 2025

thanks @bfonta !

@fwyzard
Copy link
Copy Markdown
Contributor Author

fwyzard commented Nov 26, 2025

@smuzaffar I think we can merge the update

@smuzaffar
Copy link
Copy Markdown
Contributor

@fwyzard , what about the crash[a] mentioned by @bfonta ? Should we try to integrate it first in DEVEL IBs and enable gpu tests for DEVEL IBs to see if every thing works ?

The underlying cmsRun job was killed by signal                                                                
The last lines of the error log are:                          
The following is the call stack containing the origin of the signal.

Module: non-CMSSW (crashed)

Module: HcalDigisSoAProducer@alpaka:hltHcalDigisSoA                                                  
Module: none
A fatal system signal has occurred: abort signal

@fwyzard
Copy link
Copy Markdown
Contributor Author

fwyzard commented Nov 27, 2025

No, I think we can go ahead and merge.

The HCAL-only workflow crashes with ROCm 6.3, 6.4, 7.0 and 7.1 ...

@smuzaffar
Copy link
Copy Markdown
Contributor

smuzaffar commented Nov 27, 2025

+externals

lets get this in IBs for 16.0.0.pre3

@cmsbuild
Copy link
Copy Markdown
Contributor

This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_16_0_X/master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @ftenchini, @mandrenguyen (and backports should be raised in the release meeting by the corresponding L2)

@smuzaffar smuzaffar merged commit d2ce1ad into cms-sw:IB/CMSSW_16_0_X/master Nov 27, 2025
19 checks passed
@bfonta
Copy link
Copy Markdown

bfonta commented Nov 28, 2025

I've repeated the previous study only for ROCm 7.1.0, comparing two AMD GPU driver versions: 6.12.12 and 6.16.6 (pdf version):

performance_newdriver

No crashes were observed.

Conditions are identical to the previous ones:

  • architecture: el9_amd64_gcc13
  • release: CMSSW_16_0_X_2025-11-21-2300
  • HLT menu: /frozen/2025/2e34/v1.3/CMSSW_15_1_X/HLT/V6
  • global tag: 150X_dataRun3_HLT_v1
  • input data: /shared/store/data/Run2025E/EphemeralHLTPhysics/FED/run396102/run396102_ls0295_index*.raw

I've checked whether the new driver version fixes the crashes observed when running the HGCAL-only GPU workflow, but unfortunately that is not the case.

@fwyzard fwyzard deleted the IB/CMSSW_16_0_X/master_rocm_710 branch November 28, 2025 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants