Implement GPU vs CPU comparison for HLT heterogeneous products in patatrack workflows by mmusich · Pull Request #49105 · cms-sw/cmssw

mmusich · 2025-10-08T14:39:18Z

PR description:

It has been discussed that it would be desirable to have means to assess the discrepancies in the heterogeneous reconstruction chains when run on CPU vs GPU backends at release validation level by submitting some of the existing patatrack workflows on dedicated resources.
In all those workflows the HLT menu is run as part of the step 2 and in PR #49079 we have percolated the DQMGPUvsCPU stream event content (as it is run online) to the HLTDebugRAW and HLTDebugFEVT for release validation purposes.
This means we can thus finally profit of the existing DQM infrastructure (developed for online DQM) to generate such comparisons in relvals.
The goal of this PR is to provide the infrastructural changes to do so, while making sure to not crash the process in case some of the input collections are not available.

PR validation:

I have run the following workflow

runTheMatrix.py --what gpu -l 17034.402 -t 4 -j 8 --ibeos

both in a machine equipped with a NVIDIA T4 GPU and in one without GPU attached, and I was able to inspect the output comparison plots in the earlier case.
I have also run the subset of relval test in the gpu matrix run in PR tests via:

runTheMatrix.py --job-reports -w gpu -l 17034.402,17034.403,17034.406,17034.412,17034.422,17034.423,29834.402,29834.403,29834.404,29834.704,29834.751 --ibeos

and did not observe issues.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Not a backport, I think it would be useful to backport at least to CMSSW_15_1_X.

Cc: @AdrianoDee @fwyzard @bainbrid @mtosi

cmsbuild · 2025-10-08T14:39:46Z

cms-bot internal usage

mmusich · 2025-10-08T14:41:09Z

enable gpu

cmsbuild · 2025-10-08T14:41:50Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-49105/46346

There are other open Pull requests which might conflict with changes you have proposed:
- File Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py modified in PR(s): [NGT] Extension of CA Pixel Tracking to Phase 2 Outer Tracker barrel #48921
- File DQMOffline/Configuration/python/DQMOffline_cff.py modified in PR(s): Tau DQM miniAOD validation for DATA #44029, Update BTV offline DQM sequences #46838

cmsbuild · 2025-10-08T14:42:12Z

A new Pull Request was created by @mmusich for master.

It involves the following packages:

Configuration/PyReleaseValidation (pdmv, upgrade)
DQM/PFTasks (dqm)
DQMOffline/Configuration (dqm)
DQMOffline/Trigger (dqm)

@AdrianoDee, @DickyChant, @Moanwar, @antoniovagnerini, @cmsbuild, @ctarricone, @gabrielmscampos, @miquork, @nothingface0, @rseidita, @srimanob, @subirsarkar can you please review it and eventually sign? Thanks.
@Fedespring, @HuguesBrun, @Martin-Grunewald, @cericeci, @fabiocos, @jhgoh, @makortel, @missirol, @mtosi, @rociovilar, @slomeo, @threus, @trocino this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

mmusich · 2025-10-08T14:55:24Z

@cmsbuild please test

cmsbuild · 2025-10-08T17:38:26Z

+1

Size: This PR adds an extra 56KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a812fd/48549/summary.html
COMMIT: deb6bfd
CMSSW: CMSSW_16_0_X_2025-10-08-1100/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/49105/48549/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

You potentially added 1 lines to the logs
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 51
DQMHistoTests: Total histograms compared: 3940073
DQMHistoTests: Total failures: 439
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3939614
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
Checked 218 log files, 188 edm output root files, 51 DQM output files
TriggerResults: no differences found

AMD_MI300X Comparison Summary

Summary:

You potentially added 262 lines to the logs
Reco comparison results: 261 differences found in the comparisons
DQMHistoTests: Total files compared: 11
DQMHistoTests: Total histograms compared: 146621
DQMHistoTests: Total failures: 26537
DQMHistoTests: Total nulls: 8
DQMHistoTests: Total successes: 120076
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 655776.5039999998 KiB( 10 files compared)
DQMHistoSizes: changed ( 17034.402,... ): 64240.060 KiB HLT/HeterogeneousComparisons
DQMHistoSizes: changed ( 17034.402,... ): 16087.998 KiB HLT/HcalGPUComparisonTask
DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalBarrel/EBGpuTask
DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalEndcap/EEGpuTask
Checked 42 log files, 45 edm output root files, 11 DQM output files
TriggerResults: no differences found

AMD_W7900 Comparison Summary

Summary:

You potentially added 251 lines to the logs
Reco comparison results: 271 differences found in the comparisons
DQMHistoTests: Total files compared: 11
DQMHistoTests: Total histograms compared: 146621
DQMHistoTests: Total failures: 28079
DQMHistoTests: Total nulls: 6
DQMHistoTests: Total successes: 118536
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 655776.5039999998 KiB( 10 files compared)
DQMHistoSizes: changed ( 17034.402,... ): 64240.060 KiB HLT/HeterogeneousComparisons
DQMHistoSizes: changed ( 17034.402,... ): 16087.998 KiB HLT/HcalGPUComparisonTask
DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalBarrel/EBGpuTask
DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalEndcap/EEGpuTask
Checked 42 log files, 45 edm output root files, 11 DQM output files
TriggerResults: found differences in 1 / 10 workflows

NVIDIA_H100 Comparison Summary

Summary:

You potentially added 259 lines to the logs
Reco comparison results: 265 differences found in the comparisons
DQMHistoTests: Total files compared: 11
DQMHistoTests: Total histograms compared: 146621
DQMHistoTests: Total failures: 26993
DQMHistoTests: Total nulls: 9
DQMHistoTests: Total successes: 119619
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 655776.5039999998 KiB( 10 files compared)
DQMHistoSizes: changed ( 17034.402,... ): 64240.060 KiB HLT/HeterogeneousComparisons
DQMHistoSizes: changed ( 17034.402,... ): 16087.998 KiB HLT/HcalGPUComparisonTask
DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalBarrel/EBGpuTask
DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalEndcap/EEGpuTask
Checked 42 log files, 45 edm output root files, 11 DQM output files
TriggerResults: found differences in 3 / 10 workflows

NVIDIA_L40S Comparison Summary

Summary:

You potentially added 272 lines to the logs
Reco comparison results: 206 differences found in the comparisons
DQMHistoTests: Total files compared: 11
DQMHistoTests: Total histograms compared: 146621
DQMHistoTests: Total failures: 31010
DQMHistoTests: Total nulls: 11
DQMHistoTests: Total successes: 115600
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 655776.5039999998 KiB( 10 files compared)
DQMHistoSizes: changed ( 17034.402,... ): 64240.060 KiB HLT/HeterogeneousComparisons
DQMHistoSizes: changed ( 17034.402,... ): 16087.998 KiB HLT/HcalGPUComparisonTask
DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalBarrel/EBGpuTask
DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalEndcap/EEGpuTask
Checked 42 log files, 45 edm output root files, 11 DQM output files
TriggerResults: found differences in 1 / 10 workflows

NVIDIA_T4 Comparison Summary

Summary:

You potentially added 254 lines to the logs
Reco comparison results: 251 differences found in the comparisons
DQMHistoTests: Total files compared: 11
DQMHistoTests: Total histograms compared: 146621
DQMHistoTests: Total failures: 27114
DQMHistoTests: Total nulls: 13
DQMHistoTests: Total successes: 119494
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 655776.5039999998 KiB( 10 files compared)
DQMHistoSizes: changed ( 17034.402,... ): 64240.060 KiB HLT/HeterogeneousComparisons
DQMHistoSizes: changed ( 17034.402,... ): 16087.998 KiB HLT/HcalGPUComparisonTask
DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalBarrel/EBGpuTask
DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalEndcap/EEGpuTask
Checked 42 log files, 45 edm output root files, 11 DQM output files
TriggerResults: no differences found

nothingface0 · 2025-10-09T09:14:14Z

Quick question, I would expect histograms to be filled in the HLT workspace, for the root files produced by the GPU bin by bin comparisons, but I don't see anything, e.g. here: https://cern.ch/xsnvd. Is it simply because there are no comparison failures?

mmusich · 2025-10-09T09:18:30Z

I would expect histograms to be filled in the HLT workspace, for the root files produced by the GPU bin by bin comparisons, but I don't see anything, e.g. here: https://cern.ch/xsnvd. Is it simply because there are no comparison failures?

no, it's because the workflow you chose is for phase-2 and we don't produce yet any of those products in the phase-2 menu.
If you look at Run3 there's plenty of meaningful comparisons: https://cern.ch/j3kxs

nothingface0 · 2025-10-09T09:20:07Z

no, it's because the workflow you chose is for phase-2

Ah got it, thanks!

nothingface0 · 2025-10-09T09:39:09Z

+dqm

cmsbuild · 2025-10-10T11:10:46Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-49105/46404

There are other open Pull requests which might conflict with changes you have proposed:
- File Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py modified in PR(s): [NGT] Extension of CA Pixel Tracking to Phase 2 Outer Tracker barrel #48921
- File DQMOffline/Configuration/python/DQMOffline_cff.py modified in PR(s): Tau DQM miniAOD validation for DATA #44029, Update BTV offline DQM sequences #46838

cmsbuild · 2025-10-10T11:11:09Z

Pull request #49105 was updated. @AdrianoDee, @DickyChant, @Moanwar, @antoniovagnerini, @cmsbuild, @ctarricone, @gabrielmscampos, @miquork, @nothingface0, @rseidita, @srimanob, @subirsarkar can you please check and sign again.

mmusich · 2025-10-10T11:19:42Z

@cmsbuild, please test

cmsbuild · 2025-10-11T05:40:59Z

+1

Size: This PR adds an extra 32KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a812fd/48587/summary.html
COMMIT: 47852f1
CMSSW: CMSSW_16_0_X_2025-10-10-1100/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/49105/48587/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 51
DQMHistoTests: Total histograms compared: 3940073
DQMHistoTests: Total failures: 116
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3939937
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
Checked 218 log files, 188 edm output root files, 51 DQM output files
TriggerResults: no differences found

AMD_MI300X Comparison Summary

Summary:

You potentially added 144 lines to the logs
Reco comparison results: 249 differences found in the comparisons
DQMHistoTests: Total files compared: 11
DQMHistoTests: Total histograms compared: 146621
DQMHistoTests: Total failures: 27617
DQMHistoTests: Total nulls: 10
DQMHistoTests: Total successes: 118994
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 491832.3779999999 KiB( 10 files compared)
DQMHistoSizes: changed ( 17034.402,... ): 64240.060 KiB HLT/HeterogeneousComparisons
DQMHistoSizes: changed ( 17034.402,... ): 16087.998 KiB HLT/HcalGPUComparisonTask
DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalBarrel/EBGpuTask
DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalEndcap/EEGpuTask
Checked 42 log files, 45 edm output root files, 11 DQM output files
TriggerResults: no differences found

AMD_W7900 Comparison Summary

Summary:

You potentially added 166 lines to the logs
Reco comparison results: 249 differences found in the comparisons
DQMHistoTests: Total files compared: 11
DQMHistoTests: Total histograms compared: 146621
DQMHistoTests: Total failures: 30573
DQMHistoTests: Total nulls: 11
DQMHistoTests: Total successes: 116037
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 491832.3779999999 KiB( 10 files compared)
DQMHistoSizes: changed ( 17034.402,... ): 64240.060 KiB HLT/HeterogeneousComparisons
DQMHistoSizes: changed ( 17034.402,... ): 16087.998 KiB HLT/HcalGPUComparisonTask
DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalBarrel/EBGpuTask
DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalEndcap/EEGpuTask
Checked 42 log files, 45 edm output root files, 11 DQM output files
TriggerResults: no differences found

NVIDIA_H100 Comparison Summary

Summary:

You potentially added 156 lines to the logs
Reco comparison results: 264 differences found in the comparisons
DQMHistoTests: Total files compared: 11
DQMHistoTests: Total histograms compared: 146621
DQMHistoTests: Total failures: 26705
DQMHistoTests: Total nulls: 9
DQMHistoTests: Total successes: 119907
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 491832.3779999999 KiB( 10 files compared)
DQMHistoSizes: changed ( 17034.402,... ): 64240.060 KiB HLT/HeterogeneousComparisons
DQMHistoSizes: changed ( 17034.402,... ): 16087.998 KiB HLT/HcalGPUComparisonTask
DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalBarrel/EBGpuTask
DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalEndcap/EEGpuTask
Checked 42 log files, 45 edm output root files, 11 DQM output files
TriggerResults: found differences in 3 / 10 workflows

NVIDIA_L40S Comparison Summary

Summary:

You potentially added 144 lines to the logs
Reco comparison results: 205 differences found in the comparisons
DQMHistoTests: Total files compared: 11
DQMHistoTests: Total histograms compared: 146621
DQMHistoTests: Total failures: 32255
DQMHistoTests: Total nulls: 10
DQMHistoTests: Total successes: 114356
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 491832.3779999999 KiB( 10 files compared)
DQMHistoSizes: changed ( 17034.402,... ): 64240.060 KiB HLT/HeterogeneousComparisons
DQMHistoSizes: changed ( 17034.402,... ): 16087.998 KiB HLT/HcalGPUComparisonTask
DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalBarrel/EBGpuTask
DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalEndcap/EEGpuTask
Checked 42 log files, 45 edm output root files, 11 DQM output files
TriggerResults: no differences found

NVIDIA_T4 Comparison Summary

Summary:

You potentially added 145 lines to the logs
Reco comparison results: 237 differences found in the comparisons
DQMHistoTests: Total files compared: 11
DQMHistoTests: Total histograms compared: 146621
DQMHistoTests: Total failures: 27860
DQMHistoTests: Total nulls: 10
DQMHistoTests: Total successes: 118751
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 491832.3779999999 KiB( 10 files compared)
DQMHistoSizes: changed ( 17034.402,... ): 64240.060 KiB HLT/HeterogeneousComparisons
DQMHistoSizes: changed ( 17034.402,... ): 16087.998 KiB HLT/HcalGPUComparisonTask
DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalBarrel/EBGpuTask
DQMHistoSizes: changed ( 17034.402,... ): 822.003 KiB EcalEndcap/EEGpuTask
Checked 42 log files, 45 edm output root files, 11 DQM output files
TriggerResults: no differences found

mmusich · 2025-10-13T09:24:08Z

no, it's because the workflow you chose is for phase-2 and we don't produce yet any of those products in the phase-2 menu.

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a812fd/48587/summary.html

the changes in the phase-2 workflows have been removed.

gabrielmscampos · 2025-10-14T08:04:38Z

+dqm

mmusich · 2025-10-14T08:13:59Z

@cms-sw/upgrade-l2 @cms-sw/pdmv-l2 just a kind ping.

Moanwar · 2025-10-14T09:07:44Z

+Upgrade

AdrianoDee · 2025-10-14T14:24:17Z

+pdmv

cmsbuild · 2025-10-14T14:24:42Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @mandrenguyen, @ftenchini (and backports should be raised in the release meeting by the corresponding L2)

mandrenguyen · 2025-10-14T18:00:01Z

+1

cmsbuild added this to the CMSSW_16_0_X milestone Oct 8, 2025

cmsbuild added dqm-pending pending-signatures tests-pending orp-pending pdmv-pending upgrade-pending code-checks-pending labels Oct 8, 2025

cmsbuild added code-checks-approved and removed code-checks-pending labels Oct 8, 2025

cmsbuild added tests-started and removed tests-pending labels Oct 8, 2025

cmsbuild added tests-approved and removed tests-started labels Oct 8, 2025

cmsbuild mentioned this pull request Oct 9, 2025

[NGT] Extension of CA Pixel Tracking to Phase 2 Outer Tracker barrel #48921

Merged

cmsbuild added dqm-approved and removed dqm-pending labels Oct 9, 2025

fwyzard reviewed Oct 9, 2025

View reviewed changes

Comment thread DQMOffline/Trigger/python/HeterogeneousMonitoring_cff.py Outdated

mmusich added 2 commits October 10, 2025 12:14

add an HLT heterogeneous monitoring sequence for offline DQM

4472ea9

add protections for missing input for PFHcalGPUComparisonTask

b7b413e

cmsbuild added code-checks-approved and removed code-checks-pending labels Oct 10, 2025

cmsbuild added tests-started and removed tests-pending labels Oct 10, 2025

cmsbuild added tests-approved and removed tests-started labels Oct 11, 2025

cmsbuild added dqm-approved and removed dqm-pending labels Oct 14, 2025

cmsbuild added upgrade-approved and removed upgrade-pending labels Oct 14, 2025

cmsbuild added fully-signed pdmv-approved and removed pending-signatures pdmv-pending labels Oct 14, 2025

cmsbuild added orp-approved and removed orp-pending labels Oct 14, 2025

cmsbuild merged commit 34e452c into cms-sw:master Oct 14, 2025
25 checks passed

mmusich deleted the mm_hlt_gpu_vs_cpu_comparisons_in_offline_dqm branch October 14, 2025 18:18

mmusich mentioned this pull request Nov 6, 2025

Implement GPU vs CPU comparison for HLT pixel tracking heterogeneous products in patatrack workflows #49340

Merged

Conversation

mmusich commented Oct 8, 2025

PR description:

PR validation:

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Uh oh!

cmsbuild commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mmusich commented Oct 8, 2025

Uh oh!

cmsbuild commented Oct 8, 2025

Uh oh!

cmsbuild commented Oct 8, 2025

Uh oh!

mmusich commented Oct 8, 2025

Uh oh!

cmsbuild commented Oct 8, 2025

Comparison Summary

AMD_MI300X Comparison Summary

AMD_W7900 Comparison Summary

NVIDIA_H100 Comparison Summary

NVIDIA_L40S Comparison Summary

NVIDIA_T4 Comparison Summary

Uh oh!

nothingface0 commented Oct 9, 2025

Uh oh!

mmusich commented Oct 9, 2025

Uh oh!

nothingface0 commented Oct 9, 2025

Uh oh!

nothingface0 commented Oct 9, 2025

Uh oh!

Uh oh!

cmsbuild commented Oct 10, 2025

Uh oh!

cmsbuild commented Oct 10, 2025

Uh oh!

mmusich commented Oct 10, 2025

Uh oh!

cmsbuild commented Oct 11, 2025

Comparison Summary

AMD_MI300X Comparison Summary

AMD_W7900 Comparison Summary

NVIDIA_H100 Comparison Summary

NVIDIA_L40S Comparison Summary

NVIDIA_T4 Comparison Summary

Uh oh!

mmusich commented Oct 13, 2025

Uh oh!

gabrielmscampos commented Oct 14, 2025

Uh oh!

mmusich commented Oct 14, 2025

Uh oh!

Moanwar commented Oct 14, 2025

Uh oh!

AdrianoDee commented Oct 14, 2025

Uh oh!

cmsbuild commented Oct 14, 2025

Uh oh!

mandrenguyen commented Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

cmsbuild commented Oct 8, 2025 •

edited

Loading