Skip to content

Digi Morphing for HLT#48734

Merged
cmsbuild merged 2 commits intocms-sw:masterfrom
CMSTrackerDPG:alternate_implementation_v2
Sep 5, 2025
Merged

Digi Morphing for HLT#48734
cmsbuild merged 2 commits intocms-sw:masterfrom
CMSTrackerDPG:alternate_implementation_v2

Conversation

@Chirayu18
Copy link
Copy Markdown
Contributor

@Chirayu18 Chirayu18 commented Aug 14, 2025

PR description:

Morphing Algorithm for Fake Digi Addition

This PR introduces an alternative alpaka-based algorithm for pixel digi morphing. This implementation is equivalent in principle to the legacy implementation and to the other alpaka-based implementation in PR #48343 but supercedes the latter as it has issues with throughput due to memory usage. Like the other implementations, digi morphing is applied here only to specific detector regions which can be configured as shown below.

Configuration Options

The morphing behavior can be controlled in the EDProducer configuration with the following options:

  • Enable Morphing:
    DoDigiMorphing = cms.bool(True)  # Default: False
  • Maximum Fakes per Module:
    MaxFakesInModule = cms.uint32(10000)  # Default: maxPixInModule * 2 / 5

Regional Morphing

Regions can be specified in the configuration as follows:

  • Barrel Regions:
    Use LAYER,LADDER,MODULE coordinates, with support for individual values or ranges (e.g., 1-12 for ladders 1 through 12).

    barrelRegions = cms.vstring(
        "1,1-12,1-2",
        "1,1-12,7-8",
        "2,1-28,1",
        "2,1-28,8"
    )  # Default: as shown above
  • Endcap Regions:
    Use DISK,BLADE,SIDE,PANEL coordinates, with support for ranges.

    endcapRegions = cms.vstring(
        # Default: empty (no endcap regions selected)
    )

PR validation:

Validation results can be found here: https://indico.cern.ch/event/1567986/contributions/6647897/attachments/3116797/5527819/HLT_digi_morphing_validation.pdf

@cmsbuild cmsbuild added this to the CMSSW_15_1_X milestone Aug 14, 2025
@cmsbuild
Copy link
Copy Markdown
Contributor

This PR contains too many commits (644 >= 240) and will not be processed.
Please ensure you have selected the correct target branch and consider squashing unnecessary commits.
The processing of this PR will resume once the commit count drops below the limit.

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Aug 14, 2025

cms-bot internal usage

@cmsbuild
Copy link
Copy Markdown
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-48734/45791

ERROR: Build errors found during clang-tidy run.

src/RecoLocalTracker/SiPixelClusterizer/plugins/alpaka/SiPixelRawToCluster.cc:376:43: error: too few arguments to function call, expected 12, have 11 [clang-diagnostic-error]
  366 |     Algo_.makePhase1ClustersAsync(iEvent.queue(),
      |     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  367 |                                   clusterThresholds_,
--
gmake: *** [config/SCRAM/GMake/Makefile.coderules:129: code-checks] Error 2
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 2

Comment thread .gitignore
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove these changes

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

ALPAKA_FN_ACC
bool isMorphingModule(uint32_t moduleId, const uint32_t* morphingModules, uint32_t nMorphingModules) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are more than about 10 modules, it would be much more efficient to

  • sort the modules (once, on the cpu, in the constructor of the module)
  • use a binary search to check if a module is in the list

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor

@fwyzard fwyzard Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Could you make the implementation generic and move it to a central place, like HeterogeneousCore/AlpakaInterface/interface/alpakastdAlgorithm.h ?
You may be able to reuse some of the functionality from that file.

See also HeterogeneousCore/CUDAUtilities/interface/cudastdAlgorithm.h for the old CUDA-only version.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been addressed?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but I guess it can be done later.

Comment on lines +175 to +176
// FIXME: this is just an estimate, to be studied and optimised
//static constexpr uint32_t maxFakesInModule = TrackerTraits::maxPixInModule * 2 / 5;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

uint32_t thisModuleId = digi_view[firstPixel].moduleId();
uint32_t rawModuleId = digi_view[firstPixel].rawIdArr();
applyDigiMorphing =
applyDigiMorphing && pixelStatus::isMorphingModule(rawModuleId, morphingModules, nMorphingModules);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong: a kernel block may loop over different modules (line 196); once it encounters a module that is not in the list, it will change the applyDigiMorphing to false; at that point all other modules will be ignored.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

ALPAKA_FN_ACC void operator()(Acc1D const& acc,
SiPixelDigisSoAView digi_view,
SiPixelDigisSoAView fakes_view,
bool applyDigiMorphing,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool applyDigiMorphing,
bool enableDigiMorphing,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +202 to +203
applyDigiMorphing =
applyDigiMorphing && pixelStatus::isMorphingModule(rawModuleId, morphingModules, nMorphingModules);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
applyDigiMorphing =
applyDigiMorphing && pixelStatus::isMorphingModule(rawModuleId, morphingModules, nMorphingModules);
bool applyDigiMorphing =
enableDigiMorphing && pixelStatus::isMorphingModule(rawModuleId, morphingModules, nMorphingModules);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

uint32_t rawModuleId = digi_view[firstPixel].rawIdArr();
applyDigiMorphing =
applyDigiMorphing && pixelStatus::isMorphingModule(rawModuleId, morphingModules, nMorphingModules);
//applyDigiMorphing = applyDigiMorphing && isMorphingModule;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

std::vector<std::string> limits;
boost::split(limits, r, boost::is_any_of("-"));
try {
if (limits.size() > 1) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a check that there are exactly 0 or 1 hyphens, and not more than 1 ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@cmsbuild
Copy link
Copy Markdown
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-48734/45792

Code check has found code style and quality issues which could be resolved by applying following patch(s)

Comment on lines +583 to +587
auto morphingModules_d =
cms::alpakatools::make_device_buffer<uint32_t[]>(queue, digiMorphingConfig.morphingModules.size());
auto morphingModules_h = cms::alpakatools::make_host_view(digiMorphingConfig.morphingModules.data(),
digiMorphingConfig.morphingModules.size());
alpaka::memcpy(queue, morphingModules_d, morphingModules_h);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to stay the same for the whole job.
Can you move the allocation of the buffer and the copy outside of the event loop ?
One option would be to use the cms::alpakatools::MoveToDeviceCache functionality (see the README.md).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I couln't manage to use MoveToDevice functionality as the struct I used in place of TestAlgo::UpdateInfo ( as done in the example) contains std::vector which is not trivially destructable and there are compilation issues with that.

Instead I moved this part to where the cabling map changes so that it's not repeated for every event.
Please let me know if this is fine or there is a better way

Comment on lines +699 to +707
#if 0
alpaka::exec<Acc1D>(queue,
workDivMaxNumModules,
FindClus<TrackerTraits>{},
digis_view,
unused.view(),
clusters_d->view(),
numDigis);
#endif
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is clearly not OK 🤷🏻

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +259 to +277
tTopo_ = &iSetup.getData(trackerTopologyToken_);
digiMorphingConfig_.morphingModules.clear();
for (const auto& connection : cablingMap_->det2fedMap()) {
auto rawId = connection.first;
if (rawId == 0)
continue;
DetId detId(rawId);
if (!skipDetId(tTopo_, detId, theBarrelRegions_, theEndcapRegions_)) {
digiMorphingConfig_.morphingModules.push_back(rawId);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(sorry, it is already done only if there are changes, ignore my previous comment)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the tracker topology is in a different record (TrackerTopologyRcd) than the cabling map (SiPixelFedCablingMapRcd), so the check needs to be split ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am not sure what you mean by the check needs to be split... Could you please clarify?

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Sep 1, 2025

Pull request #48734 was updated. @cmsbuild, @fwyzard, @jfernan2, @makortel, @mandrenguyen can you please check and sign again.

@fwyzard
Copy link
Copy Markdown
Contributor

fwyzard commented Sep 1, 2025

please test

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Sep 2, 2025

tests here seem stuck. Are we expecting them to finish in a finite amount of time?

@iarspider
Copy link
Copy Markdown
Contributor

@mmusich I have restarted the unit test

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Sep 2, 2025

+1

Size: This PR adds an extra 24KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-da6630/47935/summary.html
COMMIT: 386b44c
CMSSW: CMSSW_15_1_X_2025-09-01-1100/el8_amd64_gcc12
Additional Tests: GPU,AMD_MI300X,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/48734/47935/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

AMD_MI300X Comparison Summary

Summary:

NVIDIA_T4 Comparison Summary

Summary:

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Sep 4, 2025

@cms-sw/heterogeneous-l2 @cms-sw/reconstruction-l2 do you have any other comments and / or requests for this?

@fwyzard
Copy link
Copy Markdown
Contributor

fwyzard commented Sep 4, 2025

I will not be able to re-check the code in the near future, so let's assume that all comments have been addressed 🤷🏻‍♂️

@fwyzard
Copy link
Copy Markdown
Contributor

fwyzard commented Sep 4, 2025

+heterogeneous

@jfernan2
Copy link
Copy Markdown
Contributor

jfernan2 commented Sep 4, 2025

+1

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Sep 4, 2025

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @mandrenguyen, @sextonkennedy, @ftenchini (and backports should be raised in the release meeting by the corresponding L2)

@mandrenguyen
Copy link
Copy Markdown
Contributor

+1
Looking thru the review, it seems that are still some improvements that can be made to the implementation, but that they can be followed up in subsequent pull requests. Let's integrate this now to facilitate further testing before this is actually deployed in the HLT menu. As is, this PR has no effect on production workflows.

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Sep 9, 2025

I suspect this PR is causing failures in the GPU relvals since when it was merged in CMSSW_15_1_X_2025-09-05-2300.
See e.g. this log.

Also FWIW, running the addOnTests.py -t hlt_mc_HIon,hlt_data_HIon on lxplus-gpu in CMSSW_15_1_0_pre6 (without this PR), runs fine, whereas in the latest IB (such as CMSSW_15_1_X_2025-09-08-2300) they fail with e.g.:

----- Begin Fatal Exception 09-Sep-2025 09:49:11 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 362321 lumi: 231 event: 111754586 stream: 2
   [1] Running path 'HLT_HIUPC_MinPixelThrust0p8_MaxPixelCluster10000_v4'
   [2] Calling method for module SiPixelRawToClusterHIonPhase1@alpaka/'hltSiPixelClustersPPOnAASoA'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/build-any-ib/w/el9_amd64_gcc12/external/alpaka/1.2.0-ca398df5d9e013dd3373eebb7be69807/include/alpaka/event/EventUniformCudaHipRt.hpp(143) 'ret = TApi::eventQuery(event.getN\
ativeHandle())' returned error  : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
----- End Fatal Exception -------------------------------------------------

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Sep 9, 2025

Also:

You potentially added 2131218 lines to the logs

doesn't look very sane.

trackerTopologyToken_(esConsumes<TrackerTopology, TrackerTopologyRcd>()),
includeErrors_(iConfig.getParameter<bool>("IncludeErrors")),
useQuality_(iConfig.getParameter<bool>("UseQualityInfo")),
verbose_(iConfig.getParameter<bool>("verbose")),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is undoing #48494, why?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #48890.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot.

@ferencek
Copy link
Copy Markdown
Contributor

Also:

You potentially added 2131218 lines to the logs

doesn't look very sane.

This is because this PR has unintentionally undone #48494. To be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants