CPU vs. GPU for LST in HLT and updates to the offline by VourMa · Pull Request #49832 · cms-sw/cmssw

VourMa · 2026-01-14T16:55:29Z

The goal of this PR is to introduce two HLT workflows to monitor the agreement between LST on CPU and LST on GPU:

Workflow 0.7541 monitors the LST output tracks when LST is used for track building (most direct comparison of LST), i.e. for alpakaValidationLST,singleIterPatatrack,trackingLST.
Workflow 0.7573 monitors the built tracks in the upcoming new tracking baseline, where LST is used as an extended seeding algorithm (comparison of LST output in a "production" configuration), i.e. for singleIterPatatrack,phase2CAExtension,trackingLST,seedingLST,trackingMkFitCommon,hltTrackingMkFitInitialStep.

The additional CPU reconstruction (SerialSync) and comparison plots are implemented with a new procModifier, alpakaValidationLST. This procModifier needs to be run only in the procModifier combinations mentioned above to take effect, otherwise it produces neither the additional products nor the comparison plots. It is also included in the alpakaValidation modifier chain.

The analyzer that produces the comparison plots has been improved with a new parameter option to skip luminosity and PU plots.

With the introduction of the alpakaValidationLST modifier, the offline workflow testing LST on CPU vs. LST on GPU can be made explicit. The code is changed so that the heterogeneous workflow 0.712 (previously 0.704) runs the offline reconstruction without any additional CPU reconstruction, while a new workflow, 0.713, runs the comparison. Workflow 0.703 has also been renamed to 0.711. The workflow numbering changes are made so that the offline LST workflows follow the numbering conventions for Alpaka workflows.

Some screenshots of the content of the DQM file:

cmsbuild · 2026-01-14T16:56:01Z

cms-bot internal usage

cmsbuild · 2026-01-14T16:57:26Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-49832/47487

There are other open Pull requests which might conflict with changes you have proposed:
- File Configuration/PyReleaseValidation/README.md modified in PR(s): Remove alpaka procModfier from workflows in which it is no longer useful #49755
- File Configuration/PyReleaseValidation/python/relval_Run4.py modified in PR(s): Remove alpaka procModfier from workflows in which it is no longer useful #49755
- File Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py modified in PR(s): Remove alpaka procModfier from workflows in which it is no longer useful #49755, Deprecate autoCondPhase2 keys for inexistent geometries #49790, update phase-2 HLT timing script and use alpaka modifier in NGT workflows #49821

cmsbuild · 2026-01-14T16:57:53Z

A new Pull Request was created by @VourMa for master.

It involves the following packages:

Configuration/EventContent (operations)
Configuration/ProcessModifiers (operations)
Configuration/PyReleaseValidation (pdmv)
DQM/TrackingMonitorClient (dqm)
DQM/TrackingMonitorSource (dqm)
HLTrigger/Configuration (hlt)
RecoTracker/IterativeTracking (reconstruction)
Validation/RecoTrack (dqm)

@AdrianoDee, @DickyChant, @Martin-Grunewald, @Moanwar, @antoniovagnerini, @cmsbuild, @ctarricone, @davidlange6, @fabiocos, @ftenchini, @gabrielmscampos, @jfernan2, @mandrenguyen, @miquork, @mmusich, @nothingface0, @rseidita, @srimanob can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @Martin-Grunewald, @SohamBhattacharya, @VinInn, @VourMa, @arossi83, @dgulhan, @elusian, @fabiocos, @felicepantaleo, @fioriNTU, @gpetruc, @idebruyn, @jandrea, @makortel, @missirol, @mmasciov, @mmusich, @mtosi, @richa2710, @rovere, @slomeo, @sroychow, @threus, @wmtford this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

mmusich · 2026-01-14T18:59:03Z

assign heterogeneous

cmsbuild · 2026-01-14T18:59:28Z

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

mmusich · 2026-01-14T19:00:21Z

 numWFIB.extend([prefixDet+34.7521])# HLTTiming75e33, ticl_v5, ticlv5TrackLinkingGNN
 numWFIB.extend([prefixDet+34.753]) # HLTTiming75e33, alpaka,singleIterPatatrack
 numWFIB.extend([prefixDet+34.754]) # HLTTiming75e33, alpaka,singleIterPatatrack,trackingLST
+numWFIB.extend([prefixDet+34.7541]) # HLTTiming75e33, alpakaValidationLST,singleIterPatatrack,trackingLST


Shouldn't this go rather in the gpu matrix? How do I test this from the bot with a GPU backend available?

Oops, my bad. Should be fixed in the last push.

cmsbuild · 2026-01-14T21:50:28Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-49832/47496

There are other open Pull requests which might conflict with changes you have proposed:
- File Configuration/PyReleaseValidation/README.md modified in PR(s): Remove alpaka procModfier from workflows in which it is no longer useful #49755
- File Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py modified in PR(s): Remove alpaka procModfier from workflows in which it is no longer useful #49755, Deprecate autoCondPhase2 keys for inexistent geometries #49790, update phase-2 HLT timing script and use alpaka modifier in NGT workflows #49821

cmsbuild · 2026-01-14T21:50:51Z

Pull request #49832 was updated. @AdrianoDee, @DickyChant, @Martin-Grunewald, @Moanwar, @antoniovagnerini, @cmsbuild, @ctarricone, @davidlange6, @fabiocos, @ftenchini, @fwyzard, @gabrielmscampos, @jfernan2, @makortel, @mandrenguyen, @miquork, @mmusich, @nothingface0, @rseidita, @srimanob can you please check and sign again.

mmusich · 2026-01-15T08:27:23Z

enable gpu

mmusich · 2026-01-15T08:32:41Z

test parameters:

enable = hlt_p2_integration, hlt_p2_timing
workflows = ph2_hlt
enable_tests = gpu
workflows_gpu = 34434.7041, 34434.7541, 34434.7573
relvals_opt = -w upgrade,standard
relvals_opt_gpu = -w upgrade,standard

mmusich · 2026-01-15T08:32:48Z

@cmsbuild, please test

VourMa · 2026-01-21T20:10:05Z

I do not see where I might have missed a 0.704 workflow. If anyone has any suggestions, please let me know...

I think here

Oh, OK, thanks!
For my understanding, is this supposed to hard-coded and not controlled by some subset of workflows from this repository?
In any case, I can make a PR to the bot repo as well, if that's the recommended way.

mmusich · 2026-01-21T20:11:41Z

For my understanding, is this supposed to hard-coded and not controlled by some subset of workflows from this repository?

🤷‍♂️

VourMa · 2026-01-22T13:18:11Z

In any case, I can make a PR to the bot repo as well, if that's the recommended way.

The relevant PR has been made: cms-sw/cms-bot#2663

mmusich · 2026-01-22T16:36:59Z

test parameters:

enable = gpu, hlt_p2_integration, hlt_p2_timing
pull_request = Update GPU wfs for 1601 after renumbering in cmssw cms-bot#2663
workflows = ph2_hlt
workflows_gpu = 34434.712, 34434.713, 34434.7541, 34434.7573
relvals_opt = -w upgrade,standard
relvals_opt_gpu = -w upgrade,standard

mmusich · 2026-01-22T16:37:06Z

@cmsbuild, please test

cmsbuild · 2026-01-22T19:57:43Z

-1

Failed Tests: UnitTests RelVals-NVIDIA_L40S
Size: This PR adds an extra 16KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-371ebe/50817/summary.html
COMMIT: b807f98
CMSSW: CMSSW_16_1_X_2026-01-22-1100/el8_amd64_gcc13
Additional Tests: GPU,HLT_P2_INTEGRATION,HLT_P2_TIMING,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/49832/50817/install.sh to create a dev area with all the needed externals and cmssw changes.

HLT P2 Timing: chart

Failed Unit Tests

I found 1 errors in the following unit tests:

---> test RecoTrackerLSTCore-standalone-compilation had ERRORS

Failed RelVals-NVIDIA_L40S

34634.71334634.713_TTbar_14TeV+Run4D121PU_lstOnGPUIters01TrackingOnlyAlpakaValidationLST/step2_TTbar_14TeV+Run4D121PU_lstOnGPUIters01TrackingOnlyAlpakaValidationLST.log
34634.71234634.712_TTbar_14TeV+Run4D121PU_lstOnGPUIters01TrackingOnly/step2_TTbar_14TeV+Run4D121PU_lstOnGPUIters01TrackingOnly.log
34634.40334634.403_TTbar_14TeV+Run4D121PU_Patatrack_PixelOnlyAlpaka_Validation/step2_TTbar_14TeV+Run4D121PU_Patatrack_PixelOnlyAlpaka_Validation.log

Expand to see more relval errors ...

Comparison Summary

Summary:

You potentially removed 3 lines from the logs
Reco comparison results: 4 differences found in the comparisons
DQMHistoTests: Total files compared: 73
DQMHistoTests: Total histograms compared: 4814076
DQMHistoTests: Total failures: 3
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 4814053
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 72 files compared)
Checked 293 log files, 250 edm output root files, 73 DQM output files
TriggerResults: no differences found

VourMa · 2026-01-22T22:31:25Z

The failed RelVals are due to the recent, usual error:

----- Begin Fatal Exception 22-Jan-2026 19:34:42 CET-----------------------
An exception of category 'OutOfBound' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 0
   [1] Running path 'HLTriggerFinalPath'
   [2] Prefetching for module TriggerSummaryProducerAOD/'hltTriggerSummaryAOD'
   [3] Prefetching for module L1HPSPFTauProducer/'l1tHPSPFTauProducer'
   [4] Prefetching for module L1TPFCandMultiMerger/'l1tLayer1'
   [5] Prefetching for module L1TCorrelatorLayer1Producer/'l1tLayer1HGCal'
   [6] Calling method for module HGCalBackendLayer2Producer/'l1tHGCalBackEndLayer2Producer'
Exception Message:
TC X1 = 0.0683642 out of the seeding histogram bounds 0.076 - 0.58
----- End Fatal Exception -------------------------------------------------

while the failed unit test is unrelated and fixed in #49895.

nothingface0 · 2026-01-27T07:25:59Z

+dqm

VourMa · 2026-01-28T12:44:19Z

Kindly ping to the reviewers of this PR: are there any followups? This is needed for upcoming developments, so I would appreciate any feedback, so that it can be finalized.

mmusich · 2026-01-28T12:45:55Z

Kindly ping to the reviewers of this PR: are there any followups? This is needed for upcoming developments, so I would appreciate any feedback, so that it can be finalized.

I am not entirely convinced of the proposed changes in the menu. I think an approach Run-3 like in which we have a dedicated path (e.g. DQM_TrackerHeterogeneousReco where we put both flavours of the modules) would be desirable.

VourMa · 2026-01-28T13:06:10Z

I am not entirely convinced of the proposed changes in the menu. I think an approach Run-3 like in which we have a dedicated path (e.g. DQM_TrackerHeterogeneousReco where we put both flavours of the modules) would be desirable.

Thanks for the feedback. Nothing changes in the menu currently (in terms of paths, or maybe I misinterpreted what your comment). The changes are made so that a workflow with CPU vs. GPU comparison with the actual HLT configuration can be added to the matrix and run for RelVals (and that target should be satisfied by the proposed changes).

If the alternative solution has extra advantages on top of that, please let me know.

mmusich · 2026-01-28T13:12:09Z

If the alternative solution has extra advantages on top of that, please let me know.

yes it should be much more self-contained.

Nothing changes in the menu currently (in terms of paths, or maybe I misinterpreted what your comment).

this is the point I don't like. I would not like to change the whole behaviour of the menu, but just selected targeted modules in a given path.

VourMa · 2026-01-28T13:21:11Z

If the alternative solution has extra advantages on top of that, please let me know.

yes it should be much more self-contained.

Nothing changes in the menu currently (in terms of paths, or maybe I misinterpreted what your comment).

this is the point I don't like. I would not like to change the whole behaviour of the menu, but just selected targeted modules in a given path.

I see. Would it be satisfactory if a copy of MC_TRK_cfi.py was made, which would then be modified to run the two reconstructions (CPU & GPU) and then compare them in the same way as proposed in this PR?

mmusich · 2026-01-28T13:24:43Z

Would it be satisfactory if a copy of MC_TRK_cfi.py was made, which would then be modified to run the two reconstructions (CPU & GPU) and then compare them in the same way as proposed in this PR?

I think so. Two points I would advise:

if the goal is entirely to make GPU vs CPU comparison I would gate the path with process.hltBackend + process.hltStatusOnGPUFilter where:

process.hltBackend = cms.EDProducer( "AlpakaBackendProducer@alpaka"
)
process.hltStatusOnGPUFilter = cms.EDFilter( "AlpakaBackendFilter",
    producer = cms.InputTag( 'hltBackend','backend' ),
    backends = cms.vstring( 'CudaAsync',
      'ROCmAsync' )
)

to avoid running the path at all if there isn't a GPU.

I would name the path DQM_something (to keep the same nomenclature as in Run3 and potentially inspire other groups to use the same mechanism).

VourMa · 2026-01-28T13:32:57Z

Got it, thanks for the advice. Then my proposal is the following:

I close this PR;
I open another PR with the changes in the offline part only - these are well factorized and well motivated;
I open another PR for the HLT part after the tracking configuration has been simplified - we are reasonably close to update of the tracking baseline and that would severely facilitate the work.

mmusich · 2026-01-28T13:37:11Z

Then my proposal is the following:

No objections to this plan. BTW, I think the mechanism on how to implement CPU vs GPU comparisons in the phase-2 menu would be an excellent topic to discuss at some upcoming TSG/Upgrade meeting.
Thanks!

VourMa · 2026-01-29T22:13:12Z

Superseded by #49984 for the offline part. The HLT part to be followed after more urgent updates have been pushed.

cmsbuild added this to the CMSSW_16_1_X milestone Jan 14, 2026

cmsbuild added reconstruction-pending dqm-pending hlt-pending operations-pending pending-signatures tests-pending orp-pending pdmv-pending code-checks-pending tracking labels Jan 14, 2026

cmsbuild added code-checks-approved and removed code-checks-pending labels Jan 14, 2026

cmsbuild added the heterogeneous-pending label Jan 14, 2026

mmusich reviewed Jan 14, 2026

View reviewed changes

VourMa force-pushed the CMSSW_16_0_0_pre3_serialSync branch from 6fd4c49 to 2cf3f6b Compare January 14, 2026 21:48

cmsbuild added code-checks-pending and removed code-checks-approved labels Jan 14, 2026

cmsbuild added code-checks-approved and removed code-checks-pending labels Jan 14, 2026

VourMa mentioned this pull request Jan 22, 2026

Update GPU wfs for 1601 after renumbering in cmssw cms-sw/cms-bot#2663

Merged

cmsbuild added requires-external tests-started and removed tests-rejected labels Jan 22, 2026

cmsbuild added tests-rejected and removed tests-started labels Jan 22, 2026

This was referenced Jan 23, 2026

Modernization of TrackToTrackComparisonHists #49913

Merged

TICL: Consolidate v5 as Default Configuration and Cleanup Legacy Code #49932

Open

cmsbuild added dqm-approved and removed dqm-pending labels Jan 27, 2026

VourMa mentioned this pull request Jan 29, 2026

Updates to the offline CPU vs. GPU workflows for LST #49984

Merged

VourMa closed this Jan 29, 2026

VourMa mentioned this pull request Mar 18, 2026

TRK CPU vs. GPU path and infrastructure for Phase 2 HLT #50336

Merged

Conversation

VourMa commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmsbuild commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmsbuild commented Jan 14, 2026

Uh oh!

cmsbuild commented Jan 14, 2026

Uh oh!

mmusich commented Jan 14, 2026

Uh oh!

cmsbuild commented Jan 14, 2026

Uh oh!

mmusich Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

VourMa Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

cmsbuild commented Jan 14, 2026

Uh oh!

cmsbuild commented Jan 14, 2026

Uh oh!

mmusich commented Jan 15, 2026

Uh oh!

mmusich commented Jan 15, 2026

Uh oh!

mmusich commented Jan 15, 2026

Uh oh!

VourMa commented Jan 21, 2026

Uh oh!

mmusich commented Jan 21, 2026

Uh oh!

VourMa commented Jan 22, 2026

Uh oh!

mmusich commented Jan 22, 2026

Uh oh!

mmusich commented Jan 22, 2026

Uh oh!

cmsbuild commented Jan 22, 2026

Failed Unit Tests

Failed RelVals-NVIDIA_L40S

Comparison Summary

Uh oh!

VourMa commented Jan 22, 2026

Uh oh!

nothingface0 commented Jan 27, 2026

Uh oh!

VourMa commented Jan 28, 2026

Uh oh!

mmusich commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VourMa commented Jan 28, 2026

Uh oh!

mmusich commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VourMa commented Jan 28, 2026

Uh oh!

mmusich commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VourMa commented Jan 28, 2026

Uh oh!

mmusich commented Jan 28, 2026

Uh oh!

VourMa commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

VourMa commented Jan 14, 2026 •

edited

Loading

cmsbuild commented Jan 14, 2026 •

edited

Loading

mmusich commented Jan 28, 2026 •

edited

Loading

mmusich commented Jan 28, 2026 •

edited

Loading

mmusich commented Jan 28, 2026 •

edited

Loading