ECAL unpacker and ECAL multifit algorithm migration to alpaka#43257
Conversation
|
type ecal |
|
enable gpu |
|
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43257/37641 ERROR: Build errors found during clang-tidy run. |
|
test parameters:
|
|
code-checks |
|
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43257/37642 ERROR: Build errors found during clang-tidy run. |
5aaf97d to
14c3c41
Compare
|
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43257/37645 ERROR: Build errors found during clang-tidy run. |
14c3c41 to
8cf598a
Compare
|
+Upgrade |
|
+pdmv |
|
Hi @cms-sw/dqm-l2 @cms-sw/simulation-l2 do you have any comments on this PR? |
|
urgent
|
|
@cms-sw/dqm-l2 can you please have a look and sign ASAP? |
|
+1 |
|
+1
|
|
merge |
|
Hello, we're getting this message while running the Any hints on why this may happen? Dimitris for DQM-DC |
what workflow are you running ? |
Not a workflow, but streamers from run 375631 (hi_run). |
something I can test ? |
that's strange. The unit test for this client was run in the last batch of tests (see https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c1276f/36979/unitTests/src/DQM/Integration/test/TestDQMOnlineClient-pixellumi_dqm_sourceclient/testing.log) and no error was reported. Can you try on top of a more recent IB? |
I had to change MessageLogger configurations in some DQM sourceclients to pass the tests but I did not need to touch the pixellumi one. Would it work if the MessageLogger configuration is adjusted in the style of the pixel_dqm_sourceclient-live_cfg.py. |
A difference I see in the logs you posted vs. the logs on the DQM machines is: cmssdt:
DQM machines:
And directly after the line above is the Fatal Exception message. Unsure if it's related to something. |
Ah, this made me realize that the cmssw/HeterogeneousCore/AlpakaCore/python/ProcessAcceleratorAlpaka.py Lines 66 to 68 in d0954a1 relies on python machinery defined in FWCore.ParameterSet.MessageLogger, that is not available if one just defined process.MessageLogger = cms.Service("MessageLogger", ...). So all uses of that pattern should be changed to modify the existing process.MessageLogger (note that process.load(...) is not needed, MessageLogger is available from cms.Process() construction)
|
Ah... that's because
Could this be problematic for the HLT menu as it runs online ? I guess we can just add |
Not really. The framework calls the The error message in #43257 (comment) comes from MessageLogger's ParameterSet validation code. I was a little bit off in my earlier comment that not every replacement of It is only if the new (Makes me wonder if we should consider somehow preventing the "old configuration API". In general it would be complicated, or potentially break some configurations. But perhaps limiting to the cases of
Depends how the If the If there is
I believe that would work. |
|
Thanks @thomreis, the following patch seems to work: diff --git a/DQM/Integration/python/clients/pixellumi_dqm_sourceclient-live_cfg.py b/DQM/Integration/python/clients/pixellumi_dqm_sourceclient-live_cfg.py
index 2e62f7a11c7..f76d76317cb 100644
--- a/DQM/Integration/python/clients/pixellumi_dqm_sourceclient-live_cfg.py
+++ b/DQM/Integration/python/clients/pixellumi_dqm_sourceclient-live_cfg.py
@@ -13,12 +13,10 @@ unitTest=False
if 'unitTest=True' in sys.argv:
unitTest=True
-process.MessageLogger = cms.Service("MessageLogger",
- debugModules = cms.untracked.vstring('siPixelDigis',
- 'sipixelEDAClient'),
- cout = cms.untracked.PSet(threshold = cms.untracked.string('ERROR')),
- destinations = cms.untracked.vstring('cout')
-)
+process.MessageLogger.debugModules = cms.untracked.vstring('siPixelDigis',
+ 'sipixelEDAClient')
+process.MessageLogger.cout = cms.untracked.PSet(threshold = cms.untracked.string('ERROR'))
+
#----------------------------
# Event Source
We will take a look at the other clients as well and make a PR for this. |
do you understand why the unit tests didn't catch the issue? |
This only happened for |
there are no specific tests for HI data, but I don't see how this is HI-specific (unless it's bundled with the cmssw |
I see. I only mentioned it because that was how the error was discovered, but I didn't really dig into it. It will take a bit of looking into. |
|
+1
|
|
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will be automatically merged. |
PR description:
This PR migrates the ECAL unpacker and ECAL multifit algorithm to alpaka. This is a follow up PR to #42930 , in which the data formats and conditions formats have been defined.
EventFilter modules:
Amplitude and time reconstruction modules:
A customization function for the HLT menu is included that replaces the ECAL CUDA modules in the menu with alpaka ones. The customization function is not called, however, and is intended to be used in a customization function for all alpaka modules.
The
alpakamodifier is used to run the alpaka modules instead of the legacy CPU modules and the CUDA modules.The matrix workflow 12434.515 runs the alpaka ECAL local reconstruction in the HLT and the offline reconstruction step.
Alternatively, for tests the
gpumodifier in thecmsDriver.pycommand for obtaining a reconstruction configuration can be replaced with thealpakamodifier. For example in the step3 of the 12434.512 matrix workflow.@valsdav and @Jakub-Gajownik contributed to these development as well.
PR validation:
Tested with the matrix workflow 12434.513, which produces a legacy CPU vs. alpaka(-nvidia/-serial) comparison in DQM. This test requires some modifications to keep running the legacy CPU code for the
cpubranch of theSwitchProducerCUDA.For the validation 9k events from the
/store/relval/CMSSW_13_3_0_pre4/RelValZEE_14/GEN-SIM-DIGI-RAW/ 133X_mcRun3_2023_realistic_v1_Standard_13_3_0_pre4-v1sample were used.An almost perfect agreement is found between results from the the nvidia and serial backends.
Compared to the original CUDA implementation a very good agreement is found as well.
From the TimeReport, the timing of the alpaka-nvida version is close to the native CUDA implementation, while the alpaka-serial version is about 18% slower than the legacy CPU module.