Skip to content

Adapt pixel cpe algo to better handle broken clusters#47966

Closed
mroguljic wants to merge 5 commits intocms-sw:masterfrom
mroguljic:updated_cpe_algo
Closed

Adapt pixel cpe algo to better handle broken clusters#47966
mroguljic wants to merge 5 commits intocms-sw:masterfrom
mroguljic:updated_cpe_algo

Conversation

@mroguljic
Copy link
Copy Markdown
Contributor

@mroguljic mroguljic commented Apr 28, 2025

PR description:

At the end of 2024, cluster breakage at high eta became significant. This caused issues when trying to improve calibrations for pixel cluster (position) parameter estimation, CPE.

This fix changes the CPE algorithms by relying on the "good" cluster edge, instead of both edges. This alternative algorithm is only used if the clusters are shorter than expected (e.g. broken). The change affects template reconstruction, and generic reconstruction both at CPU and GPU (alpaka). The fix is gated behind process modifiers for testing. The new versions of CPE algorithms require corresponding condition updates.

A report on this has been given at the Tracker DPG meeting.

PR validation:

Undergoing runTheMatrix.py -l limited -i all --ibeos. We don't expect any workflow to be affect since the changes are protected by process modifiers. The PR was opened before the validation to allower others to comment early.

Backport
To be backported to 15.0.X: #48008

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 28, 2025

cms-bot internal usage

@cmsbuild
Copy link
Copy Markdown
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47966/44634

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@cmsbuild
Copy link
Copy Markdown
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47966/44635

@cmsbuild
Copy link
Copy Markdown
Contributor

Pull request #47966 was updated.

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Apr 28, 2025

allow @mroguljic test rights

@mroguljic
Copy link
Copy Markdown
Contributor Author

@cmsbuild, please test

@cmsbuild
Copy link
Copy Markdown
Contributor

+1

Size: This PR adds an extra 88KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c4e35d/45755/summary.html
COMMIT: 369a8ac
CMSSW: CMSSW_15_1_X_2025-04-28-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/47966/45755/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially removed 101 lines from the logs
  • Reco comparison results: 13979 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3913297
  • DQMHistoTests: Total failures: 28156
  • DQMHistoTests: Total nulls: 5
  • DQMHistoTests: Total successes: 3885116
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.063 KiB( 49 files compared)
  • DQMHistoSizes: changed ( 145.408 ): 0.020 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 145.5 ): -0.008 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 145.604 ): 0.051 KiB JetMET/SUSYDQM
  • Checked 215 log files, 184 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Copy Markdown
Contributor

@cmsbuild, please test

Once more

@cmsbuild
Copy link
Copy Markdown
Contributor

+1

Size: This PR adds an extra 16KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c4e35d/46257/summary.html
COMMIT: 9d233f4
CMSSW: CMSSW_15_1_X_2025-05-20-1100/el8_amd64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/47966/46257/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

CUDA Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 1
  • DQMHistoTests: Total histograms compared: 0
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 0
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0 KiB( 0 files compared)
  • Checked 0 log files, 0 edm output root files, 1 DQM output files

ROCM Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 1
  • DQMHistoTests: Total histograms compared: 0
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 0
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0 KiB( 0 files compared)
  • Checked 0 log files, 0 edm output root files, 1 DQM output files

@makortel
Copy link
Copy Markdown
Contributor

Ok, between the previous two tests, the workflow 13034.0 shows in OfflinePV/Alignment

  • 25 differences wrt. baseline in the first test
  • 27 differences wrt. baseline in the second test

The two histograms showing differences wrt. baseline that are in the second test but not in the first test are
image
image

The two tests used different IB as a base (CMSSW_15_1_X_2025-05-19-1100 vs CMSSW_15_1_X_2025-05-20-1100), but both were run on the same CPU (16-core AMD EPYC-Genoa Processor).

So indeed this PR seems to introduce a non-reproducibility.

@ferencek
Copy link
Copy Markdown
Contributor

@makortel, I don't know how to do it and whether it is possible, but it would also be interesting to compare the baselines from the two tests. I was seeing differences there as well as reported here (across different machines, though).

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented May 21, 2025

@ferencek

I was seeing differences there as well as reported #47966 (comment) (across different machines, though).

There are some known non-reproducibilities in cmssw in wf 29634.911 (see issue #45505), so I would not pay too much attention to that. I would concentrate on the non-reproducibiities caused by this PR in the Run-3 workflows which are normally 100% reproducible (when run on the same arch).

Comment thread Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py Outdated
Comment thread RecoLocalTracker/Configuration/python/customizeHLT.py Outdated
Comment thread Configuration/PyReleaseValidation/README.md Outdated
Comment thread Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py Outdated
pixel cpe goodEdgeAlgo: simplified generic implementation and resolved wf collision

Co-authored-by: Dinko F. <Dinko.Ferencek@cern.ch>
@cmsbuild
Copy link
Copy Markdown
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47966/44891

@cmsbuild
Copy link
Copy Markdown
Contributor

@ferencek
Copy link
Copy Markdown
Contributor

@mmusich, thanks for the pointer. I was not aware of that.

Let me also mention that I discovered a plotting bug in the validation plots for the generic CPE applied to tracking RecHits posted earlier. The good edge curve actually corresponds to the template case and the effect of the new algorithm is in line with what we see for the template CPE, i.e., not much improvement. This is something we are now trying to understand.

@mroguljic
Copy link
Copy Markdown
Contributor Author

I just applied small changes based on code review, before further efforts on the wf differences. No need to test it yet.

@makortel
Copy link
Copy Markdown
Contributor

I don't know how to do it and whether it is possible

I'd suggest to start with valgrind, e.g.

valgrind --tool=memcheck $(cmsvgsupp) --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$ROOTSYS/etc/valgrind-root-python.supp --num-callers=20 --track-origins=yes cmsRun <config.py>

and post the (possibly large) log somewhere.

but it would also be interesting to compare the baselines from the two tests. I was seeing differences there as well as reported here (across different machines, though).

Small(?) differences between Intel and AMD are known, and (presumably) caused by some packages being compiled with -Ofast that enables fast but non-standard math functions. See e.g. #45576 and #40089.

@mroguljic
Copy link
Copy Markdown
Contributor Author

Could someone please run tests in this draft? It would help me towards understanding the workflow differences we see here.

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented May 28, 2025

based on the results shown here can the Pixel DPG team clarify what's the prognosis for this PR?

@mroguljic
Copy link
Copy Markdown
Contributor Author

based on the results shown here can the Pixel DPG team clarify what's the prognosis for this PR?

During simulation testing of the algorithm proposed in this PR, based on the end of 2024, we observed unexpected results. Unexpected results were also seen with the current implementation of the generic CPE algorithm. We need to fully understand these anomalies before we can reliably validate the proposed changes. Intensive work is ongoing and we will follow up in the PR once things are understood.

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Jun 19, 2025

-hlt

@mroguljic
Copy link
Copy Markdown
Contributor Author

Closing the PR because it is superseeded by #48356

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.