Skip to content

Implement GPU vs CPU comparison for HLT heterogeneous products in patatrack workflows#49105

Merged
cmsbuild merged 6 commits intocms-sw:masterfrom
mmusich:mm_hlt_gpu_vs_cpu_comparisons_in_offline_dqm
Oct 14, 2025
Merged

Implement GPU vs CPU comparison for HLT heterogeneous products in patatrack workflows#49105
cmsbuild merged 6 commits intocms-sw:masterfrom
mmusich:mm_hlt_gpu_vs_cpu_comparisons_in_offline_dqm

Conversation

@mmusich
Copy link
Copy Markdown
Contributor

@mmusich mmusich commented Oct 8, 2025

PR description:

It has been discussed that it would be desirable to have means to assess the discrepancies in the heterogeneous reconstruction chains when run on CPU vs GPU backends at release validation level by submitting some of the existing patatrack workflows on dedicated resources.
In all those workflows the HLT menu is run as part of the step 2 and in PR #49079 we have percolated the DQMGPUvsCPU stream event content (as it is run online) to the HLTDebugRAW and HLTDebugFEVT for release validation purposes.
This means we can thus finally profit of the existing DQM infrastructure (developed for online DQM) to generate such comparisons in relvals.
The goal of this PR is to provide the infrastructural changes to do so, while making sure to not crash the process in case some of the input collections are not available.

PR validation:

I have run the following workflow

runTheMatrix.py --what gpu -l 17034.402 -t 4 -j 8 --ibeos

both in a machine equipped with a NVIDIA T4 GPU and in one without GPU attached, and I was able to inspect the output comparison plots in the earlier case.
I have also run the subset of relval test in the gpu matrix run in PR tests via:

runTheMatrix.py --job-reports -w gpu -l 17034.402,17034.403,17034.406,17034.412,17034.422,17034.423,29834.402,29834.403,29834.404,29834.704,29834.751 --ibeos

and did not observe issues.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Not a backport, I think it would be useful to backport at least to CMSSW_15_1_X.

Cc: @AdrianoDee @fwyzard @bainbrid @mtosi

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Oct 8, 2025

cms-bot internal usage

@mmusich
Copy link
Copy Markdown
Contributor Author

mmusich commented Oct 8, 2025

enable gpu

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Oct 8, 2025

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-49105/46346

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Oct 8, 2025

A new Pull Request was created by @mmusich for master.

It involves the following packages:

  • Configuration/PyReleaseValidation (pdmv, upgrade)
  • DQM/PFTasks (dqm)
  • DQMOffline/Configuration (dqm)
  • DQMOffline/Trigger (dqm)

@AdrianoDee, @DickyChant, @Moanwar, @antoniovagnerini, @cmsbuild, @ctarricone, @gabrielmscampos, @miquork, @nothingface0, @rseidita, @srimanob, @subirsarkar can you please review it and eventually sign? Thanks.
@Fedespring, @HuguesBrun, @Martin-Grunewald, @cericeci, @fabiocos, @jhgoh, @makortel, @missirol, @mtosi, @rociovilar, @slomeo, @threus, @trocino this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@mmusich
Copy link
Copy Markdown
Contributor Author

mmusich commented Oct 8, 2025

@cmsbuild please test

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Oct 8, 2025

+1

Size: This PR adds an extra 56KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a812fd/48549/summary.html
COMMIT: deb6bfd
CMSSW: CMSSW_16_0_X_2025-10-08-1100/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/49105/48549/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 1 lines to the logs
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 51
  • DQMHistoTests: Total histograms compared: 3940073
  • DQMHistoTests: Total failures: 439
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3939614
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
  • Checked 218 log files, 188 edm output root files, 51 DQM output files
  • TriggerResults: no differences found

AMD_MI300X Comparison Summary

Summary:

  • You potentially added 262 lines to the logs
  • Reco comparison results: 261 differences found in the comparisons
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 146621
  • DQMHistoTests: Total failures: 26537
  • DQMHistoTests: Total nulls: 8
  • DQMHistoTests: Total successes: 120076
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 655776.5039999998 KiB( 10 files compared)
  • DQMHistoSizes: changed ( 17034.402,... ): 64240.060 KiB HLT/HeterogeneousComparisons
  • DQMHistoSizes: changed ( 17034.402,... ): 16087.998 KiB HLT/HcalGPUComparisonTask
  • DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalBarrel/EBGpuTask
  • DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalEndcap/EEGpuTask
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

AMD_W7900 Comparison Summary

Summary:

  • You potentially added 251 lines to the logs
  • Reco comparison results: 271 differences found in the comparisons
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 146621
  • DQMHistoTests: Total failures: 28079
  • DQMHistoTests: Total nulls: 6
  • DQMHistoTests: Total successes: 118536
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 655776.5039999998 KiB( 10 files compared)
  • DQMHistoSizes: changed ( 17034.402,... ): 64240.060 KiB HLT/HeterogeneousComparisons
  • DQMHistoSizes: changed ( 17034.402,... ): 16087.998 KiB HLT/HcalGPUComparisonTask
  • DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalBarrel/EBGpuTask
  • DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalEndcap/EEGpuTask
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: found differences in 1 / 10 workflows

NVIDIA_H100 Comparison Summary

Summary:

  • You potentially added 259 lines to the logs
  • Reco comparison results: 265 differences found in the comparisons
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 146621
  • DQMHistoTests: Total failures: 26993
  • DQMHistoTests: Total nulls: 9
  • DQMHistoTests: Total successes: 119619
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 655776.5039999998 KiB( 10 files compared)
  • DQMHistoSizes: changed ( 17034.402,... ): 64240.060 KiB HLT/HeterogeneousComparisons
  • DQMHistoSizes: changed ( 17034.402,... ): 16087.998 KiB HLT/HcalGPUComparisonTask
  • DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalBarrel/EBGpuTask
  • DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalEndcap/EEGpuTask
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: found differences in 3 / 10 workflows

NVIDIA_L40S Comparison Summary

Summary:

  • You potentially added 272 lines to the logs
  • Reco comparison results: 206 differences found in the comparisons
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 146621
  • DQMHistoTests: Total failures: 31010
  • DQMHistoTests: Total nulls: 11
  • DQMHistoTests: Total successes: 115600
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 655776.5039999998 KiB( 10 files compared)
  • DQMHistoSizes: changed ( 17034.402,... ): 64240.060 KiB HLT/HeterogeneousComparisons
  • DQMHistoSizes: changed ( 17034.402,... ): 16087.998 KiB HLT/HcalGPUComparisonTask
  • DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalBarrel/EBGpuTask
  • DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalEndcap/EEGpuTask
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: found differences in 1 / 10 workflows

NVIDIA_T4 Comparison Summary

Summary:

  • You potentially added 254 lines to the logs
  • Reco comparison results: 251 differences found in the comparisons
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 146621
  • DQMHistoTests: Total failures: 27114
  • DQMHistoTests: Total nulls: 13
  • DQMHistoTests: Total successes: 119494
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 655776.5039999998 KiB( 10 files compared)
  • DQMHistoSizes: changed ( 17034.402,... ): 64240.060 KiB HLT/HeterogeneousComparisons
  • DQMHistoSizes: changed ( 17034.402,... ): 16087.998 KiB HLT/HcalGPUComparisonTask
  • DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalBarrel/EBGpuTask
  • DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalEndcap/EEGpuTask
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

@nothingface0
Copy link
Copy Markdown
Contributor

Quick question, I would expect histograms to be filled in the HLT workspace, for the root files produced by the GPU bin by bin comparisons, but I don't see anything, e.g. here: https://cern.ch/xsnvd. Is it simply because there are no comparison failures?

@mmusich
Copy link
Copy Markdown
Contributor Author

mmusich commented Oct 9, 2025

I would expect histograms to be filled in the HLT workspace, for the root files produced by the GPU bin by bin comparisons, but I don't see anything, e.g. here: https://cern.ch/xsnvd. Is it simply because there are no comparison failures?

no, it's because the workflow you chose is for phase-2 and we don't produce yet any of those products in the phase-2 menu.
If you look at Run3 there's plenty of meaningful comparisons: https://cern.ch/j3kxs

@nothingface0
Copy link
Copy Markdown
Contributor

no, it's because the workflow you chose is for phase-2

Ah got it, thanks!

@nothingface0
Copy link
Copy Markdown
Contributor

+dqm

Comment thread DQMOffline/Trigger/python/HeterogeneousMonitoring_cff.py Outdated
@cmsbuild
Copy link
Copy Markdown
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-49105/46404

@cmsbuild
Copy link
Copy Markdown
Contributor

Pull request #49105 was updated. @AdrianoDee, @DickyChant, @Moanwar, @antoniovagnerini, @cmsbuild, @ctarricone, @gabrielmscampos, @miquork, @nothingface0, @rseidita, @srimanob, @subirsarkar can you please check and sign again.

@mmusich
Copy link
Copy Markdown
Contributor Author

mmusich commented Oct 10, 2025

@cmsbuild, please test

@cmsbuild
Copy link
Copy Markdown
Contributor

+1

Size: This PR adds an extra 32KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a812fd/48587/summary.html
COMMIT: 47852f1
CMSSW: CMSSW_16_0_X_2025-10-10-1100/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/49105/48587/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 51
  • DQMHistoTests: Total histograms compared: 3940073
  • DQMHistoTests: Total failures: 116
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3939937
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
  • Checked 218 log files, 188 edm output root files, 51 DQM output files
  • TriggerResults: no differences found

AMD_MI300X Comparison Summary

Summary:

  • You potentially added 144 lines to the logs
  • Reco comparison results: 249 differences found in the comparisons
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 146621
  • DQMHistoTests: Total failures: 27617
  • DQMHistoTests: Total nulls: 10
  • DQMHistoTests: Total successes: 118994
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 491832.3779999999 KiB( 10 files compared)
  • DQMHistoSizes: changed ( 17034.402,... ): 64240.060 KiB HLT/HeterogeneousComparisons
  • DQMHistoSizes: changed ( 17034.402,... ): 16087.998 KiB HLT/HcalGPUComparisonTask
  • DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalBarrel/EBGpuTask
  • DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalEndcap/EEGpuTask
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

AMD_W7900 Comparison Summary

Summary:

  • You potentially added 166 lines to the logs
  • Reco comparison results: 249 differences found in the comparisons
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 146621
  • DQMHistoTests: Total failures: 30573
  • DQMHistoTests: Total nulls: 11
  • DQMHistoTests: Total successes: 116037
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 491832.3779999999 KiB( 10 files compared)
  • DQMHistoSizes: changed ( 17034.402,... ): 64240.060 KiB HLT/HeterogeneousComparisons
  • DQMHistoSizes: changed ( 17034.402,... ): 16087.998 KiB HLT/HcalGPUComparisonTask
  • DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalBarrel/EBGpuTask
  • DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalEndcap/EEGpuTask
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

NVIDIA_H100 Comparison Summary

Summary:

  • You potentially added 156 lines to the logs
  • Reco comparison results: 264 differences found in the comparisons
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 146621
  • DQMHistoTests: Total failures: 26705
  • DQMHistoTests: Total nulls: 9
  • DQMHistoTests: Total successes: 119907
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 491832.3779999999 KiB( 10 files compared)
  • DQMHistoSizes: changed ( 17034.402,... ): 64240.060 KiB HLT/HeterogeneousComparisons
  • DQMHistoSizes: changed ( 17034.402,... ): 16087.998 KiB HLT/HcalGPUComparisonTask
  • DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalBarrel/EBGpuTask
  • DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalEndcap/EEGpuTask
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: found differences in 3 / 10 workflows

NVIDIA_L40S Comparison Summary

Summary:

  • You potentially added 144 lines to the logs
  • Reco comparison results: 205 differences found in the comparisons
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 146621
  • DQMHistoTests: Total failures: 32255
  • DQMHistoTests: Total nulls: 10
  • DQMHistoTests: Total successes: 114356
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 491832.3779999999 KiB( 10 files compared)
  • DQMHistoSizes: changed ( 17034.402,... ): 64240.060 KiB HLT/HeterogeneousComparisons
  • DQMHistoSizes: changed ( 17034.402,... ): 16087.998 KiB HLT/HcalGPUComparisonTask
  • DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalBarrel/EBGpuTask
  • DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalEndcap/EEGpuTask
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

NVIDIA_T4 Comparison Summary

Summary:

  • You potentially added 145 lines to the logs
  • Reco comparison results: 237 differences found in the comparisons
  • DQMHistoTests: Total files compared: 11
  • DQMHistoTests: Total histograms compared: 146621
  • DQMHistoTests: Total failures: 27860
  • DQMHistoTests: Total nulls: 10
  • DQMHistoTests: Total successes: 118751
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 491832.3779999999 KiB( 10 files compared)
  • DQMHistoSizes: changed ( 17034.402,... ): 64240.060 KiB HLT/HeterogeneousComparisons
  • DQMHistoSizes: changed ( 17034.402,... ): 16087.998 KiB HLT/HcalGPUComparisonTask
  • DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalBarrel/EBGpuTask
  • DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalEndcap/EEGpuTask
  • Checked 42 log files, 45 edm output root files, 11 DQM output files
  • TriggerResults: no differences found

@mmusich
Copy link
Copy Markdown
Contributor Author

mmusich commented Oct 13, 2025

no, it's because the workflow you chose is for phase-2 and we don't produce yet any of those products in the phase-2 menu.

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a812fd/48587/summary.html

the changes in the phase-2 workflows have been removed.

@gabrielmscampos
Copy link
Copy Markdown
Member

+dqm

@mmusich
Copy link
Copy Markdown
Contributor Author

mmusich commented Oct 14, 2025

@cms-sw/upgrade-l2 @cms-sw/pdmv-l2 just a kind ping.

@Moanwar
Copy link
Copy Markdown
Contributor

Moanwar commented Oct 14, 2025

+Upgrade

@AdrianoDee
Copy link
Copy Markdown
Contributor

+pdmv

@cmsbuild
Copy link
Copy Markdown
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @mandrenguyen, @ftenchini (and backports should be raised in the release meeting by the corresponding L2)

@mandrenguyen
Copy link
Copy Markdown
Contributor

+1

@cmsbuild cmsbuild merged commit 34e452c into cms-sw:master Oct 14, 2025
25 checks passed
@mmusich mmusich deleted the mm_hlt_gpu_vs_cpu_comparisons_in_offline_dqm branch October 14, 2025 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants