Skip to content

ECAL unpacker and ECAL multifit algorithm migration to alpaka#43257

Merged
cmsbuild merged 1 commit intocms-sw:masterfrom
thomreis:ecal-reco-gpu-alpaka-migration-multifit-integration-140x
Jan 26, 2024
Merged

ECAL unpacker and ECAL multifit algorithm migration to alpaka#43257
cmsbuild merged 1 commit intocms-sw:masterfrom
thomreis:ecal-reco-gpu-alpaka-migration-multifit-integration-140x

Conversation

@thomreis
Copy link
Copy Markdown
Contributor

@thomreis thomreis commented Nov 13, 2023

PR description:

This PR migrates the ECAL unpacker and ECAL multifit algorithm to alpaka. This is a follow up PR to #42930 , in which the data formats and conditions formats have been defined.

EventFilter modules:

  • Unpacker
  • Portable digis to legacy digis converter
  • Electronics mapping ESProducer

Amplitude and time reconstruction modules:

  • Multifit conditions ESProducer
  • Multifit parameters ESProducer
  • Multifit uncalibrated rechits producer
  • Portable uncalibrated rechit to CPU uncalibrated rechit converter

A customization function for the HLT menu is included that replaces the ECAL CUDA modules in the menu with alpaka ones. The customization function is not called, however, and is intended to be used in a customization function for all alpaka modules.

The alpaka modifier is used to run the alpaka modules instead of the legacy CPU modules and the CUDA modules.
The matrix workflow 12434.515 runs the alpaka ECAL local reconstruction in the HLT and the offline reconstruction step.
Alternatively, for tests the gpu modifier in the cmsDriver.py command for obtaining a reconstruction configuration can be replaced with the alpaka modifier. For example in the step3 of the 12434.512 matrix workflow.

@valsdav and @Jakub-Gajownik contributed to these development as well.

PR validation:

Tested with the matrix workflow 12434.513, which produces a legacy CPU vs. alpaka(-nvidia/-serial) comparison in DQM. This test requires some modifications to keep running the legacy CPU code for the cpu branch of the SwitchProducerCUDA.

For the validation 9k events from the /store/relval/CMSSW_13_3_0_pre4/RelValZEE_14/GEN-SIM-DIGI-RAW/ 133X_mcRun3_2023_realistic_v1_Standard_13_3_0_pre4-v1 sample were used.

An almost perfect agreement is found between results from the the nvidia and serial backends.
Compared to the original CUDA implementation a very good agreement is found as well.

From the TimeReport, the timing of the alpaka-nvida version is close to the native CUDA implementation, while the alpaka-serial version is about 18% slower than the legacy CPU module.

@thomreis
Copy link
Copy Markdown
Contributor Author

type ecal

@thomreis
Copy link
Copy Markdown
Contributor Author

enable gpu

@cmsbuild cmsbuild added the ecal label Nov 13, 2023
@cmsbuild
Copy link
Copy Markdown
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43257/37641

ERROR: Build errors found during clang-tidy run.

  177 |             for (auto tx : cms::alpakatools::elements_in_block(acc, block, nchannels * nsamples)) {
      |                            ~~~~~~~~~~~~~~~~~~^
RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h:171:47: error: no member named 'blocks_with_stride' in namespace 'cms::alpakatools' [clang-diagnostic-error]
  171 |           for (auto block : cms::alpakatools::blocks_with_stride(acc, totalElements)) {
      |                             ~~~~~~~~~~~~~~~~~~^
RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h:172:47: error: no member named 'elements_in_block' in namespace 'cms::alpakatools' [clang-diagnostic-error]
  172 |             for (auto gtx : cms::alpakatools::elements_in_block(acc, block, totalElements)) {
      |                             ~~~~~~~~~~~~~~~~~~^
RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h:406:49: error: no member named 'elements_in_block' in namespace 'cms::alpakatools' [clang-diagnostic-error]
  406 |               for (auto gtx : cms::alpakatools::elements_in_block(acc, block, totalElements)) {
      |                               ~~~~~~~~~~~~~~~~~~^
RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h:422:47: error: no member named 'elements_in_block' in namespace 'cms::alpakatools' [clang-diagnostic-error]
  422 |             for (auto gtx : cms::alpakatools::elements_in_block(acc, block, totalElements)) {
      |                             ~~~~~~~~~~~~~~~~~~^
RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h:471:49: error: no member named 'elements_in_block' in namespace 'cms::alpakatools' [clang-diagnostic-error]
  471 |               for (auto gtx : cms::alpakatools::elements_in_block(acc, block, totalElements)) {
      |                               ~~~~~~~~~~~~~~~~~~^
RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h:488:47: error: no member named 'elements_in_block' in namespace 'cms::alpakatools' [clang-diagnostic-error]
  488 |             for (auto gtx : cms::alpakatools::elements_in_block(acc, block, totalElements)) {
      |                             ~~~~~~~~~~~~~~~~~~^
Suppressed 596 warnings (596 in non-user code).
--
gmake: *** [config/SCRAM/GMake/Makefile.coderules:129: code-checks] Error 2
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 2

@thomreis
Copy link
Copy Markdown
Contributor Author

test parameters:

@thomreis
Copy link
Copy Markdown
Contributor Author

code-checks

@cmsbuild
Copy link
Copy Markdown
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43257/37642

ERROR: Build errors found during clang-tidy run.

  177 |             for (auto tx : cms::alpakatools::elements_in_block(acc, block, nchannels * nsamples)) {
      |                            ~~~~~~~~~~~~~~~~~~^
RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h:171:47: error: no member named 'blocks_with_stride' in namespace 'cms::alpakatools' [clang-diagnostic-error]
  171 |           for (auto block : cms::alpakatools::blocks_with_stride(acc, totalElements)) {
      |                             ~~~~~~~~~~~~~~~~~~^
RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h:172:47: error: no member named 'elements_in_block' in namespace 'cms::alpakatools' [clang-diagnostic-error]
  172 |             for (auto gtx : cms::alpakatools::elements_in_block(acc, block, totalElements)) {
      |                             ~~~~~~~~~~~~~~~~~~^
RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h:406:49: error: no member named 'elements_in_block' in namespace 'cms::alpakatools' [clang-diagnostic-error]
  406 |               for (auto gtx : cms::alpakatools::elements_in_block(acc, block, totalElements)) {
      |                               ~~~~~~~~~~~~~~~~~~^
RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h:422:47: error: no member named 'elements_in_block' in namespace 'cms::alpakatools' [clang-diagnostic-error]
  422 |             for (auto gtx : cms::alpakatools::elements_in_block(acc, block, totalElements)) {
      |                             ~~~~~~~~~~~~~~~~~~^
RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h:471:49: error: no member named 'elements_in_block' in namespace 'cms::alpakatools' [clang-diagnostic-error]
  471 |               for (auto gtx : cms::alpakatools::elements_in_block(acc, block, totalElements)) {
      |                               ~~~~~~~~~~~~~~~~~~^
RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h:488:47: error: no member named 'elements_in_block' in namespace 'cms::alpakatools' [clang-diagnostic-error]
  488 |             for (auto gtx : cms::alpakatools::elements_in_block(acc, block, totalElements)) {
      |                             ~~~~~~~~~~~~~~~~~~^
Suppressed 596 warnings (596 in non-user code).
--
gmake: *** [config/SCRAM/GMake/Makefile.coderules:129: code-checks] Error 2
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 2

@thomreis thomreis force-pushed the ecal-reco-gpu-alpaka-migration-multifit-integration-140x branch from 5aaf97d to 14c3c41 Compare November 13, 2023 15:55
@cmsbuild
Copy link
Copy Markdown
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43257/37645

ERROR: Build errors found during clang-tidy run.

  177 |             for (auto tx : cms::alpakatools::elements_in_block(acc, block, nchannels * nsamples)) {
      |                            ~~~~~~~~~~~~~~~~~~^
RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h:171:47: error: no member named 'blocks_with_stride' in namespace 'cms::alpakatools' [clang-diagnostic-error]
  171 |           for (auto block : cms::alpakatools::blocks_with_stride(acc, totalElements)) {
      |                             ~~~~~~~~~~~~~~~~~~^
RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h:172:47: error: no member named 'elements_in_block' in namespace 'cms::alpakatools' [clang-diagnostic-error]
  172 |             for (auto gtx : cms::alpakatools::elements_in_block(acc, block, totalElements)) {
      |                             ~~~~~~~~~~~~~~~~~~^
RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h:406:49: error: no member named 'elements_in_block' in namespace 'cms::alpakatools' [clang-diagnostic-error]
  406 |               for (auto gtx : cms::alpakatools::elements_in_block(acc, block, totalElements)) {
      |                               ~~~~~~~~~~~~~~~~~~^
RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h:422:47: error: no member named 'elements_in_block' in namespace 'cms::alpakatools' [clang-diagnostic-error]
  422 |             for (auto gtx : cms::alpakatools::elements_in_block(acc, block, totalElements)) {
      |                             ~~~~~~~~~~~~~~~~~~^
RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h:471:49: error: no member named 'elements_in_block' in namespace 'cms::alpakatools' [clang-diagnostic-error]
  471 |               for (auto gtx : cms::alpakatools::elements_in_block(acc, block, totalElements)) {
      |                               ~~~~~~~~~~~~~~~~~~^
RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h:488:47: error: no member named 'elements_in_block' in namespace 'cms::alpakatools' [clang-diagnostic-error]
  488 |             for (auto gtx : cms::alpakatools::elements_in_block(acc, block, totalElements)) {
      |                             ~~~~~~~~~~~~~~~~~~^
Suppressed 596 warnings (596 in non-user code).
--
gmake: *** [config/SCRAM/GMake/Makefile.coderules:129: code-checks] Error 2
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 2

@thomreis thomreis force-pushed the ecal-reco-gpu-alpaka-migration-multifit-integration-140x branch from 14c3c41 to 8cf598a Compare November 13, 2023 16:02
@srimanob
Copy link
Copy Markdown
Contributor

+Upgrade

@AdrianoDee
Copy link
Copy Markdown
Contributor

+pdmv

@thomreis
Copy link
Copy Markdown
Contributor Author

Hi @cms-sw/dqm-l2 @cms-sw/simulation-l2 do you have any comments on this PR?

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Jan 25, 2024

urgent

  • needed for 2024 HLT production

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Jan 26, 2024

@cms-sw/dqm-l2 can you please have a look and sign ASAP?

@civanch
Copy link
Copy Markdown
Contributor

civanch commented Jan 26, 2024

+1

@rappoccio
Copy link
Copy Markdown
Contributor

+1

  • Bypassing DQM signature to get it into pre3

@rappoccio
Copy link
Copy Markdown
Contributor

merge

@nothingface0
Copy link
Copy Markdown
Contributor

nothingface0 commented Jan 26, 2024

Hello, we're getting this message while running the pixellumi DQM client with this PR on top of 14_0_0_pre2:

MessageLogger  PSet: 
AlpakaService is an unrecognized name for a PSet
MessageLogger  PSet: 
CUDAService is an unrecognized name for a PSet

Any hints on why this may happen?

Dimitris for DQM-DC

@fwyzard
Copy link
Copy Markdown
Contributor

fwyzard commented Jan 26, 2024

Any hints on why this may happen?

what workflow are you running ?

@nothingface0
Copy link
Copy Markdown
Contributor

nothingface0 commented Jan 26, 2024

Any hints on why this may happen?

what workflow are you running ?

Not a workflow, but streamers from run 375631 (hi_run).

@fwyzard
Copy link
Copy Markdown
Contributor

fwyzard commented Jan 26, 2024

Not a workflow, but streamers from run 375631.

something I can test ?

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Jan 26, 2024

Hello, we're getting this message while running the pixellumi DQM client with this PR on top of 14_0_0_pre2:

that's strange. The unit test for this client was run in the last batch of tests (see https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c1276f/36979/unitTests/src/DQM/Integration/test/TestDQMOnlineClient-pixellumi_dqm_sourceclient/testing.log) and no error was reported.

Can you try on top of a more recent IB?

@thomreis
Copy link
Copy Markdown
Contributor Author

Hello, we're getting this message while running the pixellumi DQM client with this PR on top of 14_0_0_pre2:

MessageLogger  PSet: 
AlpakaService is an unrecognized name for a PSet
MessageLogger  PSet: 
CUDAService is an unrecognized name for a PSet

Any hints on why this may happen?

Dimitris for DQM-DC

I had to change MessageLogger configurations in some DQM sourceclients to pass the tests but I did not need to touch the pixellumi one. Would it work if the MessageLogger configuration is adjusted in the style of the pixel_dqm_sourceclient-live_cfg.py.

@nothingface0
Copy link
Copy Markdown
Contributor

nothingface0 commented Jan 26, 2024

that's strange. The unit test for this client was run in the last batch of tests (see cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c1276f/36979/unitTests/src/DQM/Integration/test/TestDQMOnlineClient-pixellumi_dqm_sourceclient/testing.log) and no error was reported.

Can you try on top of a more recent IB?

A difference I see in the logs you posted vs. the logs on the DQM machines is:

cmssdt:

Monitoring file not found, disabling.

DQM machines:

Monitoring pipe: /tmp/dqm_monitoring/.es_monitoring_pid01066671

And directly after the line above is the Fatal Exception message.

Unsure if it's related to something.

@makortel
Copy link
Copy Markdown
Contributor

Hello, we're getting this message while running the pixellumi DQM client with this PR on top of 14_0_0_pre2:

MessageLogger  PSet: 
AlpakaService is an unrecognized name for a PSet
MessageLogger  PSet: 
CUDAService is an unrecognized name for a PSet

Any hints on why this may happen?
Dimitris for DQM-DC

I had to change MessageLogger configurations in some DQM sourceclients to pass the tests but I did not need to touch the pixellumi one. Would it work if the MessageLogger configuration is adjusted in the style of the pixel_dqm_sourceclient-live_cfg.py.

Ah, this made me realize that the ProcessAccelerator{Alpaka,CUDA}

# Propagate the AlpakaService messages through the MessageLogger
if not hasattr(process.MessageLogger, "AlpakaService"):
process.MessageLogger.AlpakaService = cms.untracked.PSet()

relies on python machinery defined in FWCore.ParameterSet.MessageLogger, that is not available if one just defined process.MessageLogger = cms.Service("MessageLogger", ...). So all uses of that pattern should be changed to modify the existing process.MessageLogger (note that process.load(...) is not needed, MessageLogger is available from cms.Process() construction)

@fwyzard
Copy link
Copy Markdown
Contributor

fwyzard commented Jan 26, 2024

Ah, this made me realize that the ProcessAccelerator{Alpaka,CUDA} relies on python machinery defined in FWCore.ParameterSet.MessageLogger, that is not available if one just defined process.MessageLogger = cms.Service("MessageLogger", ...).

Ah... that's because HeterogeneousCore/AlpakaCore/python/ProcessAcceleratorAlpaka.py modified the configuration of the process.MessageLogger that exists when it is used -- and if we explicitly set a process.MessageLogger afterwards, we are overwriting the customised one ?

So all uses of that pattern should be changed to modify the existing process.MessageLogger (note that process.load(...) is not needed, MessageLogger is available from cms.Process() construction)

Could this be problematic for the HLT menu as it runs online ?

I guess we can just add process.MessageLogger.AlpakaService explicitly in the menu ?

@makortel
Copy link
Copy Markdown
Contributor

makortel commented Jan 26, 2024

Ah, this made me realize that the ProcessAccelerator{Alpaka,CUDA} relies on python machinery defined in FWCore.ParameterSet.MessageLogger, that is not available if one just defined process.MessageLogger = cms.Service("MessageLogger", ...).

Ah... that's because HeterogeneousCore/AlpakaCore/python/ProcessAcceleratorAlpaka.py modified the configuration of the process.MessageLogger that exists when it is used -- and if we explicitly set a process.MessageLogger afterwards, we are overwriting the customised one ?

Not really. The framework calls the ProcessAcceleratorAlpaka.apply() during the process it "serializes" the (result) of the python configuration for the C++ code, so nothing can overwrite the MessageLogger object after that (well, except other ProcessAccelerator, but they should not do that; hmm, maybe we should be explicitly prevent that).

The error message in #43257 (comment) comes from MessageLogger's ParameterSet validation code. I was a little bit off in my earlier comment that not every replacement of process.MessageLogger object along process.MessageLogger = cms.Service("MessageLogger", ...) would cause the failure.

It is only if the new MessageLogger object uses the "old configuration API" (in this case the existence of process.MessageLogger.destinations makes the configuration to be the interpreted according to the "old API"), the setting by ProcessAcceleratorAlpaka (which relies the "new configuration API") causes the failure.

(Makes me wonder if we should consider somehow preventing the "old configuration API". In general it would be complicated, or potentially break some configurations. But perhaps limiting to the cases of ProcessAcceleratorAlpaka etc we could at least achieve a clearer error message).

So all uses of that pattern should be changed to modify the existing process.MessageLogger (note that process.load(...) is not needed, MessageLogger is available from cms.Process() construction)

Could this be problematic for the HLT menu as it runs online ?

Depends how the process.MessageLogger object is first defined in the online case.

If the process.MessageLogger is "just used", or loaded via process.load('FWCore.MessageService.MessageLogger_cfi')' (or process.load('FWCore.MessageLogger.MessageLogger_cfi')', there is no problem.

If there is process.MessageLogger = cms.Service("MessageLogger", ...) or process.add_(cms.Service("MessageLogger", ...)) (or equivalent) anywhere, and the "old configuration API" is used, there will be a problem.

I guess we can just add process.MessageLogger.AlpakaService explicitly in the menu ?

I believe that would work.

@nothingface0
Copy link
Copy Markdown
Contributor

nothingface0 commented Jan 30, 2024

Thanks @thomreis, the following patch seems to work:

diff --git a/DQM/Integration/python/clients/pixellumi_dqm_sourceclient-live_cfg.py b/DQM/Integration/python/clients/pixellumi_dqm_sourceclient-live_cfg.py
index 2e62f7a11c7..f76d76317cb 100644
--- a/DQM/Integration/python/clients/pixellumi_dqm_sourceclient-live_cfg.py
+++ b/DQM/Integration/python/clients/pixellumi_dqm_sourceclient-live_cfg.py
@@ -13,12 +13,10 @@ unitTest=False
 if 'unitTest=True' in sys.argv:
     unitTest=True
 
-process.MessageLogger = cms.Service("MessageLogger",
-    debugModules = cms.untracked.vstring('siPixelDigis', 
-                                        'sipixelEDAClient'),
-    cout = cms.untracked.PSet(threshold = cms.untracked.string('ERROR')),
-    destinations = cms.untracked.vstring('cout')
-)
+process.MessageLogger.debugModules = cms.untracked.vstring('siPixelDigis',
+                                                           'sipixelEDAClient')
+process.MessageLogger.cout = cms.untracked.PSet(threshold = cms.untracked.string('ERROR'))
+
 
 #----------------------------
 # Event Source

We will take a look at the other clients as well and make a PR for this.

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Jan 30, 2024

@nothingface0

We will take a look at the other clients as well and make a PR for this.

do you understand why the unit tests didn't catch the issue?

@nothingface0
Copy link
Copy Markdown
Contributor

nothingface0 commented Jan 30, 2024

@mmusich

do you understand why the unit tests didn't catch the issue?

This only happened for hi_run data. I'm not sure if there are unit tests for HI data?

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Jan 30, 2024

This only happened for hi_run data. I'm not sure if there are unit tests for HI data?

there are no specific tests for HI data, but I don't see how this is HI-specific (unless it's bundled with the cmssw era !?)

@nothingface0
Copy link
Copy Markdown
Contributor

nothingface0 commented Jan 30, 2024

I don't see how this is HI-specific.

I see. I only mentioned it because that was how the error was discovered, but I didn't really dig into it. It will take a bit of looking into.

@tjavaid
Copy link
Copy Markdown

tjavaid commented Jun 27, 2024

+1

  • To remove from dqm queue

@cmsbuild
Copy link
Copy Markdown
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will be automatically merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.