Adapt pixel cpe algo to better handle broken clusters by mroguljic · Pull Request #47966 · cms-sw/cmssw

mroguljic · 2025-04-28T14:42:05Z

PR description:

At the end of 2024, cluster breakage at high eta became significant. This caused issues when trying to improve calibrations for pixel cluster (position) parameter estimation, CPE.

This fix changes the CPE algorithms by relying on the "good" cluster edge, instead of both edges. This alternative algorithm is only used if the clusters are shorter than expected (e.g. broken). The change affects template reconstruction, and generic reconstruction both at CPU and GPU (alpaka). The fix is gated behind process modifiers for testing. The new versions of CPE algorithms require corresponding condition updates.

A report on this has been given at the Tracker DPG meeting.

PR validation:

Undergoing runTheMatrix.py -l limited -i all --ibeos. We don't expect any workflow to be affect since the changes are protected by process modifiers. The PR was opened before the validation to allower others to comment early.

Backport
To be backported to 15.0.X: #48008

cmsbuild · 2025-04-28T14:42:26Z

cms-bot internal usage

cmsbuild · 2025-04-28T14:43:50Z

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47966/44634

There are other open Pull requests which might conflict with changes you have proposed:
- File RecoLocalTracker/SiPixelRecHits/interface/pixelCPEforDevice.h modified in PR(s): Remove legacy CUDA modules for pixel track and vertex reconstruction #45853, A More Flexible And Lightweight CA #47611
- File RecoLocalTracker/SiPixelRecHits/plugins/alpaka/PixelCPEFastParamsESProducerAlpaka.cc modified in PR(s): Use TkPixelCPERecord for PixelCPEFastParams* #46852, A More Flexible And Lightweight CA #47611
- File RecoLocalTracker/SiPixelRecHits/src/PixelCPEFastParamsHost.cc modified in PR(s): CA Extension to strips #47090, A More Flexible And Lightweight CA #47611

Code check has found code style and quality issues which could be resolved by applying following patch(s)

code-format:
https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47966/44634/code-format.patch
e.g. curl -k https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47966/44634/code-format.patch | patch -p1
You can also run scram build code-format to apply code format directly

cmsbuild · 2025-04-28T14:48:26Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47966/44635

There are other open Pull requests which might conflict with changes you have proposed:
- File RecoLocalTracker/SiPixelRecHits/interface/pixelCPEforDevice.h modified in PR(s): Remove legacy CUDA modules for pixel track and vertex reconstruction #45853, A More Flexible And Lightweight CA #47611
- File RecoLocalTracker/SiPixelRecHits/plugins/alpaka/PixelCPEFastParamsESProducerAlpaka.cc modified in PR(s): Use TkPixelCPERecord for PixelCPEFastParams* #46852, A More Flexible And Lightweight CA #47611
- File RecoLocalTracker/SiPixelRecHits/src/PixelCPEFastParamsHost.cc modified in PR(s): CA Extension to strips #47090, A More Flexible And Lightweight CA #47611

cmsbuild · 2025-04-28T14:51:59Z

Pull request #47966 was updated.

mmusich · 2025-04-28T15:01:24Z

allow @mroguljic test rights

mroguljic · 2025-04-28T15:16:51Z

@cmsbuild, please test

cmsbuild · 2025-04-28T18:57:59Z

+1

Size: This PR adds an extra 88KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c4e35d/45755/summary.html
COMMIT: 369a8ac
CMSSW: CMSSW_15_1_X_2025-04-28-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/47966/45755/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

You potentially removed 101 lines from the logs
Reco comparison results: 13979 differences found in the comparisons
DQMHistoTests: Total files compared: 50
DQMHistoTests: Total histograms compared: 3913297
DQMHistoTests: Total failures: 28156
DQMHistoTests: Total nulls: 5
DQMHistoTests: Total successes: 3885116
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.063 KiB( 49 files compared)
DQMHistoSizes: changed ( 145.408 ): 0.020 KiB JetMET/SUSYDQM
DQMHistoSizes: changed ( 145.5 ): -0.008 KiB JetMET/SUSYDQM
DQMHistoSizes: changed ( 145.604 ): 0.051 KiB JetMET/SUSYDQM
Checked 215 log files, 184 edm output root files, 50 DQM output files
TriggerResults: no differences found

makortel · 2025-05-20T13:35:48Z

@cmsbuild, please test

Once more

cmsbuild · 2025-05-20T16:32:27Z

+1

Size: This PR adds an extra 16KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c4e35d/46257/summary.html
COMMIT: 9d233f4
CMSSW: CMSSW_15_1_X_2025-05-20-1100/el8_amd64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/47966/46257/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

You potentially removed 27 lines from the logs
Reco comparison results: 4103 differences found in the comparisons
DQMHistoTests: Total files compared: 50
DQMHistoTests: Total histograms compared: 4038193
DQMHistoTests: Total failures: 16737
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 4021436
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
Checked 215 log files, 184 edm output root files, 50 DQM output files
TriggerResults: no differences found

CUDA Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 1
DQMHistoTests: Total histograms compared: 0
DQMHistoTests: Total failures: 0
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 0
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0 KiB( 0 files compared)
Checked 0 log files, 0 edm output root files, 1 DQM output files

ROCM Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 1
DQMHistoTests: Total histograms compared: 0
DQMHistoTests: Total failures: 0
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 0
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0 KiB( 0 files compared)
Checked 0 log files, 0 edm output root files, 1 DQM output files

makortel · 2025-05-20T17:44:44Z

Ok, between the previous two tests, the workflow 13034.0 shows in OfflinePV/Alignment

25 differences wrt. baseline in the first test
27 differences wrt. baseline in the second test

The two histograms showing differences wrt. baseline that are in the second test but not in the first test are

The two tests used different IB as a base (CMSSW_15_1_X_2025-05-19-1100 vs CMSSW_15_1_X_2025-05-20-1100), but both were run on the same CPU (16-core AMD EPYC-Genoa Processor).

So indeed this PR seems to introduce a non-reproducibility.

ferencek · 2025-05-21T08:17:28Z

@makortel, I don't know how to do it and whether it is possible, but it would also be interesting to compare the baselines from the two tests. I was seeing differences there as well as reported here (across different machines, though).

mmusich · 2025-05-21T08:26:13Z

@ferencek

I was seeing differences there as well as reported #47966 (comment) (across different machines, though).

There are some known non-reproducibilities in cmssw in wf 29634.911 (see issue #45505), so I would not pay too much attention to that. I would concentrate on the non-reproducibiities caused by this PR in the Run-3 workflows which are normally 100% reproducible (when run on the same arch).

pixel cpe goodEdgeAlgo: simplified generic implementation and resolved wf collision Co-authored-by: Dinko F. <Dinko.Ferencek@cern.ch>

cmsbuild · 2025-05-21T11:41:38Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47966/44891

There are other open Pull requests which might conflict with changes you have proposed:
- File Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py modified in PR(s): CA Extension to strips #47090, TICL-barrel: run CLUE in the barrel calorimeters and first workflows #47859, [NGT] Introduce NanoAOD flavour for Phase 2 HLT #48091, Fixes for HGCal validation at HLT #48114
- File RecoLocalTracker/SiPixelRecHits/interface/pixelCPEforDevice.h modified in PR(s): Remove legacy CUDA modules for pixel track and vertex reconstruction #45853, A More Flexible And Lightweight CA #47611
- File RecoLocalTracker/SiPixelRecHits/plugins/alpaka/PixelCPEFastParamsESProducerAlpaka.cc modified in PR(s): Use TkPixelCPERecord for PixelCPEFastParams* #46852, A More Flexible And Lightweight CA #47611
- File RecoLocalTracker/SiPixelRecHits/python/PixelCPEESProducers_cff.py modified in PR(s): CA Extension to strips #47090
- File RecoLocalTracker/SiPixelRecHits/src/PixelCPEFastParamsHost.cc modified in PR(s): CA Extension to strips #47090, A More Flexible And Lightweight CA #47611

cmsbuild · 2025-05-21T11:42:05Z

Pull request #47966 was updated. @AdrianoDee, @Martin-Grunewald, @Moanwar, @antoniovilela, @atpathak, @cmsbuild, @davidlange6, @DickyChant, @fabiocos, @francescobrivio, @fwyzard, @jfernan2, @makortel, @mandrenguyen, @miquork, @mmusich, @perrotta, @rappoccio, @srimanob, @subirsarkar can you please check and sign again.

ferencek · 2025-05-21T11:42:26Z

@mmusich, thanks for the pointer. I was not aware of that.

Let me also mention that I discovered a plotting bug in the validation plots for the generic CPE applied to tracking RecHits posted earlier. The good edge curve actually corresponds to the template case and the effect of the new algorithm is in line with what we see for the template CPE, i.e., not much improvement. This is something we are now trying to understand.

mroguljic · 2025-05-21T11:42:28Z

I just applied small changes based on code review, before further efforts on the wf differences. No need to test it yet.

makortel · 2025-05-21T13:26:19Z

I don't know how to do it and whether it is possible

I'd suggest to start with valgrind, e.g.

valgrind --tool=memcheck $(cmsvgsupp) --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$ROOTSYS/etc/valgrind-root-python.supp --num-callers=20 --track-origins=yes cmsRun <config.py>

and post the (possibly large) log somewhere.

but it would also be interesting to compare the baselines from the two tests. I was seeing differences there as well as reported here (across different machines, though).

Small(?) differences between Intel and AMD are known, and (presumably) caused by some packages being compiled with -Ofast that enables fast but non-standard math functions. See e.g. #45576 and #40089.

mroguljic · 2025-05-22T11:18:48Z

Could someone please run tests in this draft? It would help me towards understanding the workflow differences we see here.

mmusich · 2025-05-28T07:22:40Z

based on the results shown here can the Pixel DPG team clarify what's the prognosis for this PR?

mroguljic · 2025-05-28T12:14:51Z

based on the results shown here can the Pixel DPG team clarify what's the prognosis for this PR?

During simulation testing of the algorithm proposed in this PR, based on the end of 2024, we observed unexpected results. Unexpected results were also seen with the current implementation of the generic CPE algorithm. We need to fully understand these anomalies before we can reliably validate the proposed changes. Intensive work is ongoing and we will follow up in the PR once things are understood.

mmusich · 2025-06-19T07:00:42Z

-hlt

IIUC this PR is superseded by Adapt template pixel cpe algo to better handle shortened or broken clusters #48356

mroguljic · 2025-06-19T09:20:49Z

Closing the PR because it is superseeded by #48356

cmsbuild added this to the CMSSW_15_1_X milestone Apr 28, 2025

cmsbuild added reconstruction-pending db-pending pending-signatures tests-pending orp-pending code-checks-pending trk labels Apr 28, 2025

cmsbuild added code-checks-rejected and removed code-checks-pending labels Apr 28, 2025

mroguljic force-pushed the updated_cpe_algo branch from bab6a5b to 369a8ac Compare April 28, 2025 14:46

cmsbuild added code-checks-pending and removed code-checks-rejected labels Apr 28, 2025

cmsbuild added code-checks-approved and removed code-checks-pending labels Apr 28, 2025

cmsbuild added the allow-mroguljic label Apr 28, 2025

cmsbuild added tests-started and removed tests-pending labels Apr 28, 2025

cmsbuild added tests-approved and removed tests-started labels Apr 28, 2025

mroguljic force-pushed the updated_cpe_algo branch from 369a8ac to 1e286ad Compare May 5, 2025 08:31

cmsbuild removed tests-approved code-checks-approved labels May 5, 2025

mroguljic commented May 21, 2025

View reviewed changes

Comment thread Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py Outdated

mroguljic commented May 21, 2025

View reviewed changes

Comment thread RecoLocalTracker/Configuration/python/customizeHLT.py Outdated

mroguljic commented May 21, 2025

View reviewed changes

Comment thread Configuration/PyReleaseValidation/README.md Outdated

mroguljic commented May 21, 2025

View reviewed changes

Comment thread Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py Outdated

Apply suggestions from code review

61beee2

pixel cpe goodEdgeAlgo: simplified generic implementation and resolved wf collision Co-authored-by: Dinko F. <Dinko.Ferencek@cern.ch>

mroguljic mentioned this pull request May 22, 2025

[DRAFT] minimum changes to produce wf diffs seen in PR#47966 #48146

Closed

This was referenced May 22, 2025

Update FastSim 2024 Era, add 2025 workflow #48153

Merged

Phase2-hgx364 Make a new scenario Run4D121 using D116 + corrected RPC with the corresponding workflow 34434.0 #48165

Merged

This was referenced Jun 11, 2025

Phase2-hgx364P1 Update the workflow for V19 version of HGCal scenarios #48292

Merged

Customization for HLT doublet-recovery tracking iteration with mkFit #48316

Merged

mroguljic mentioned this pull request Jun 18, 2025

Adapt template pixel cpe algo to better handle shortened or broken clusters #48356

Merged

mroguljic mentioned this pull request Jun 24, 2025

[15.0.X] Adapt template pixel cpe algo to better handle shortened or broken clusters #48399

Merged

Conversation

mroguljic commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmsbuild commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmsbuild commented Apr 28, 2025

Uh oh!

cmsbuild commented Apr 28, 2025

Uh oh!

cmsbuild commented Apr 28, 2025

Uh oh!

mmusich commented Apr 28, 2025

Uh oh!

mroguljic commented Apr 28, 2025

Uh oh!

cmsbuild commented Apr 28, 2025

Comparison Summary

Uh oh!

makortel commented May 20, 2025

Uh oh!

cmsbuild commented May 20, 2025

Comparison Summary

CUDA Comparison Summary

ROCM Comparison Summary

Uh oh!

makortel commented May 20, 2025

Uh oh!

ferencek commented May 21, 2025

Uh oh!

mmusich commented May 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cmsbuild commented May 21, 2025

Uh oh!

cmsbuild commented May 21, 2025

Uh oh!

ferencek commented May 21, 2025

Uh oh!

mroguljic commented May 21, 2025

Uh oh!

makortel commented May 21, 2025

Uh oh!

mroguljic commented May 22, 2025

Uh oh!

mmusich commented May 28, 2025

Uh oh!

mroguljic commented May 28, 2025

Uh oh!

mmusich commented Jun 19, 2025

Uh oh!

mroguljic commented Jun 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

mroguljic commented Apr 28, 2025 •

edited

Loading

cmsbuild commented Apr 28, 2025 •

edited

Loading