LST: add LSTGeometry package and associated ESProducer by ariostas · Pull Request #50679 · cms-sw/cmssw

ariostas · 2026-04-07T15:58:50Z

This PR adds a new RecoTracker/LSTGeometry package containing the module map computation used by the LST algorithm. Currently, the maps are pre-computed by the code in https://github.com/SegmentLinking/LSTGeometry and they are stored in https://github.com/cms-data/RecoTracker-LSTCore. This PR allows for the on-the-fly computation of these maps via an ESProducer, ensuring that they stay consistent with the tracker geometry being used.

This is the last major task in #46746.

c.c. @slava77

cmsbuild · 2026-04-07T15:59:20Z

cms-bot internal usage

cmsbuild · 2026-04-07T16:01:25Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50679/48907

There are other open Pull requests which might conflict with changes you have proposed:
- File HLTrigger/Configuration/python/HLT_75e33_cff.py modified in PR(s): Phase2 Single_Tau_Trigger Path Added #49637, TICL: Consolidate v5 as Default Configuration and Cleanup Legacy Code #49932
- File HLTrigger/Configuration/python/HLT_75e33_timing_cff.py modified in PR(s): Phase2 Single_Tau_Trigger Path Added #49637, TICL: Consolidate v5 as Default Configuration and Cleanup Legacy Code #49932
- File HLTrigger/Configuration/python/HLT_NGTScouting_cff.py modified in PR(s): TICL: Consolidate v5 as Default Configuration and Cleanup Legacy Code #49932

cmsbuild · 2026-04-07T16:01:48Z

A new Pull Request was created by @ariostas for master.

It involves the following packages:

HLTrigger/Configuration (hlt)
RecoTracker/IterativeTracking (reconstruction)
RecoTracker/LST (reconstruction)
RecoTracker/LSTCore (reconstruction)
RecoTracker/LSTGeometry (****)

The following packages do not have a category, yet:

RecoTracker/LSTGeometry
Please create a PR for https://github.com/cms-sw/cms-bot/blob/master/categories_map.py to assign category

@Martin-Grunewald, @Moanwar, @cmsbuild, @jfernan2, @mandrenguyen, @mmusich, @srimanob can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @Martin-Grunewald, @SohamBhattacharya, @VinInn, @VourMa, @dgulhan, @elusian, @felicepantaleo, @gpetruc, @missirol, @mmasciov, @mmusich, @mtosi, @rovere this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

mmusich · 2026-04-07T16:15:20Z

test parameters:

enable = hlt_p2_integration, hlt_p2_timing
workflows = ph2_hlt

mmusich · 2026-04-07T16:15:27Z

@cmsbuild, please test

cmsbuild · 2026-04-07T19:02:25Z

-1

Failed Tests: UnitTests HLTP2Timing
Size: This PR adds an extra 104KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7657dc/52513/summary.html
COMMIT: e612f24
CMSSW: CMSSW_17_0_X_2026-04-07-1100/el8_amd64_gcc13
Additional Tests: HLT_P2_INTEGRATION,HLT_P2_TIMING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/50679/52513/install.sh to create a dev area with all the needed externals and cmssw changes.

Failed Unit Tests

I found 1 errors in the following unit tests:

---> test test-das-selected-lumis had ERRORS

Comparison Summary

Summary:

You potentially removed 1 lines from the logs
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 68
DQMHistoTests: Total histograms compared: 4795858
DQMHistoTests: Total failures: 6232
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 4789606
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 67 files compared)
Checked 282 log files, 243 edm output root files, 68 DQM output files
TriggerResults: no differences found

Max Memory Comparisons exceeding threshold

@cms-sw/core-l2 , I found 17 workflow step(s) with memory usage exceeding the error threshold:

Expand to see workflows ...

Error: Workflow 34434.0_TTbar_14TeV+Run4D121 step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34434.75_TTbar_14TeV+Run4D121_HLT75e33Timing step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34434.7501_TTbar_14TeV+Run4D121_HLT75e33TrackingOnly step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34434.7502_TTbar_14TeV+Run4D121_HLT75e33TrackingNtuple step2 max memory diff 191.9 exceeds +/- 90.0 MiB
Error: Workflow 34434.751_TTbar_14TeV+Run4D121_HLT75e33TimingAlpaka step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34434.752_TTbar_14TeV+Run4D121_HLT75e33TimingTiclV5 step2 max memory diff 189.8 exceeds +/- 90.0 MiB
Error: Workflow 34434.7521_TTbar_14TeV+Run4D121_HLT75e33TimingTiclV5TrackLinkGNN step2 max memory diff 166.0 exceeds +/- 90.0 MiB
Error: Workflow 34434.755_TTbar_14TeV+Run4D121_HLT75e33TimingLST step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34434.756_TTbar_14TeV+Run4D121_HLT75e33TimingTrimmedTracking step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34434.757_TTbar_14TeV+Run4D121_HLT75e33TimingMkFitFit step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34434.758_TTbar_14TeV+Run4D121_HLT75e33TimingTiclBarrel step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34434.759_TTbar_14TeV+Run4D121_HLTPhase2WithNano step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34434.775_TTbar_14TeV+Run4D121_NGTScoutingCAExtensionMergeT5 step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34434.911_TTbar_14TeV+Run4D121_DD4hep step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34496.0_CloseByPGun_CE_E_Front_120um+Run4D121 step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34500.0_CloseByPGun_CE_H_Coarse_Scint+Run4D121 step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34634.999_TTbar_14TeV+Run4D121PU_PMXS1S2PR step3 max memory diff 191.8 exceeds +/- 90.0 MiB

makortel · 2026-04-07T19:14:17Z

Is ~190 MB increase in memory usage expected?

ariostas · 2026-04-07T19:36:19Z

Is ~190 MB increase in memory usage expected?

That seems a bit high, but it's likely. I'll double-check. Either way, it is only temporarily. Most of it is freed once the maps are constructed.

makortel · 2026-04-07T19:43:25Z

Is ~190 MB increase in memory usage expected?

That seems a bit high, but it's likely. I'll double-check. Either way, it is only temporarily. Most of it is freed once the maps are constructed.

According to the monitoring the peak memory usage would increase by ~190 MB, and thus freeing it afterwards doesn't help much if the job was killed because of going over the limit.

makortel · 2026-04-07T19:44:30Z

test parameters:

workflows_profiling = 34434.0
enable_tests = profiling

makortel · 2026-04-07T19:44:46Z

@cmsbuild, please test

Maybe one round of profiling tests would be worth it.

mmusich · 2026-04-21T13:23:28Z

@cmsbuild, please test with #50479

ariostas · 2026-04-21T13:23:32Z

please clarify if you were able to reproduce a crash with An exception of category 'AsyncCallNotAllowed' occurred while as in the fractionally available logs from the HLTP2Timing tests on a machine with a T4 (e.g. lxplus-gpu).

I did back when I hadn't tightened the module maps and the undelying reason was that it was running out of memory. I assume the same thing is still happening. I'm looking into it to see what else could be constributing to higher vram usage.

slava77 · 2026-04-21T14:00:30Z

I did back when I hadn't tightened the module maps and the undelying reason was that it was running out of memory. I assume the same thing is still happening. I'm looking into it to see what else could be constributing to higher vram usage.

OK. I thought/misunderstood that it went away on that test machine with the latest updates.

mmusich · 2026-04-21T14:54:00Z

the fractionally available logs from the HLTP2Timing

now you have full logs link:

21-Apr-2026 16:32:44 CEST  Initiating request to open file file:/data/user/cmsbuild//store/relval/CMSSW_15_1_0_pre3/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/PU_150X_mcRun4_realistic_v1_STD_Run4D110_PU-v1/2590000/00c675dc-1517-4af7-8dd4-841e0668fefe.root
21-Apr-2026 16:32:58 CEST  Successfully opened file file:/data/user/cmsbuild//store/relval/CMSSW_15_1_0_pre3/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/PU_150X_mcRun4_realistic_v1_STD_Run4D110_PU-v1/2590000/00c675dc-1517-4af7-8dd4-841e0668fefe.root
Begin processing the 1st record. Run 1, Event 7301, LumiSection 74 on stream 12 at 21-Apr-2026 16:33:19.818 CEST
Begin processing the 2nd record. Run 1, Event 7302, LumiSection 74 on stream 7 at 21-Apr-2026 16:33:19.819 CEST
Begin processing the 3rd record. Run 1, Event 7303, LumiSection 74 on stream 4 at 21-Apr-2026 16:33:19.819 CEST
Begin processing the 4th record. Run 1, Event 7304, LumiSection 74 on stream 8 at 21-Apr-2026 16:33:19.819 CEST
Begin processing the 5th record. Run 1, Event 7305, LumiSection 74 on stream 14 at 21-Apr-2026 16:33:19.820 CEST
Begin processing the 6th record. Run 1, Event 7306, LumiSection 74 on stream 13 at 21-Apr-2026 16:33:19.820 CEST
Begin processing the 7th record. Run 1, Event 7307, LumiSection 74 on stream 2 at 21-Apr-2026 16:33:19.820 CEST
Begin processing the 8th record. Run 1, Event 7308, LumiSection 74 on stream 9 at 21-Apr-2026 16:33:19.820 CEST
Begin processing the 9th record. Run 1, Event 7309, LumiSection 74 on stream 1 at 21-Apr-2026 16:33:19.821 CEST
Begin processing the 10th record. Run 1, Event 7310, LumiSection 74 on stream 15 at 21-Apr-2026 16:33:19.821 CEST
Begin processing the 11th record. Run 1, Event 7311, LumiSection 74 on stream 10 at 21-Apr-2026 16:33:19.821 CEST
Begin processing the 12th record. Run 1, Event 7312, LumiSection 74 on stream 11 at 21-Apr-2026 16:33:19.821 CEST
Begin processing the 13th record. Run 1, Event 7313, LumiSection 74 on stream 5 at 21-Apr-2026 16:33:19.822 CEST
Begin processing the 14th record. Run 1, Event 7314, LumiSection 74 on stream 0 at 21-Apr-2026 16:33:19.822 CEST
Begin processing the 15th record. Run 1, Event 7315, LumiSection 74 on stream 6 at 21-Apr-2026 16:33:19.822 CEST
Begin processing the 16th record. Run 1, Event 7316, LumiSection 74 on stream 3 at 21-Apr-2026 16:33:19.822 CEST
Begin processing the 17th record. Run 1, Event 7317, LumiSection 74 on stream 7 at 21-Apr-2026 16:33:20.236 CEST
Begin processing the 18th record. Run 1, Event 7318, LumiSection 74 on stream 15 at 21-Apr-2026 16:33:22.029 CEST
Begin processing the 19th record. Run 1, Event 7319, LumiSection 74 on stream 12 at 21-Apr-2026 16:33:24.905 CEST
----- Begin Fatal Exception 21-Apr-2026 16:33:29 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 74 event: 7318 stream: 15
   [1] Running path 'HLT_DoublePFPuppiJets128_DoublePFPuppiBTagDeepFlavour_2p4'
   [2] Calling method for module LSTProducer@alpaka/'hltLST'
Exception Message:
A std::exception was thrown.
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02938/el8_amd64_gcc13/external/alpaka/2.1.1-3caaac8d71f39d400ab2511b2403675a/include/alpaka/mem/buf/uniformCudaHip/traits/BufUniformCudaHipRtTraits.hpp(212) 'TApi::malloc(&memPtr, static_cast<std::size_t>(getWidth(extent)) * sizeof(TElem))' returned error  : 'cudaErrorMemoryAllocation': 'out of memory'!
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 21-Apr-2026 16:33:29 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 74 event: 7304 stream: 8
   [1] Running path 'HLT_PFPuppiHT1070'
   [2] Calling method for module LSTProducer@alpaka/'hltLST'
Exception Message:
A std::exception was thrown.
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02938/el8_amd64_gcc13/external/alpaka/2.1.1-3caaac8d71f39d400ab2511b2403675a/include/alpaka/mem/buf/uniformCudaHip/traits/BufUniformCudaHipRtTraits.hpp(212) 'TApi::malloc(&memPtr, static_cast<std::size_t>(getWidth(extent)) * sizeof(TElem))' returned error  : 'cudaErrorMemoryAllocation': 'out of memory'!
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 21-Apr-2026 16:33:29 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 74 event: 7311 stream: 10
   [1] Running path 'HLT_DoublePFPuppiJets128_DoublePFPuppiBTagDeepFlavour_2p4'
   [2] Calling method for module LSTProducer@alpaka/'hltLST'
Exception Message:
A std::exception was thrown.
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02938/el8_amd64_gcc13/external/alpaka/2.1.1-3caaac8d71f39d400ab2511b2403675a/include/alpaka/mem/buf/uniformCudaHip/traits/BufUniformCudaHipRtTraits.hpp(212) 'TApi::malloc(&memPtr, static_cast<std::size_t>(getWidth(extent)) * sizeof(TElem))' returned error  : 'cudaErrorMemoryAllocation': 'out of memory'!
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 21-Apr-2026 16:33:29 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 74 event: 7315 stream: 6
   [1] Running path 'HLT_PFPuppiHT1070'
   [2] Calling method for module LSTProducer@alpaka/'hltLST'
Exception Message:
A std::exception was thrown.
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02938/el8_amd64_gcc13/external/alpaka/2.1.1-3caaac8d71f39d400ab2511b2403675a/include/alpaka/mem/buf/uniformCudaHip/traits/BufUniformCudaHipRtTraits.hpp(212) 'TApi::malloc(&memPtr, static_cast<std::size_t>(getWidth(extent)) * sizeof(TElem))' returned error  : 'cudaErrorMemoryAllocation': 'out of memory'!
----- End Fatal Exception -------------------------------------------------

cmsbuild · 2026-04-21T16:09:50Z

-1

Failed Tests: HLTP2Timing
Size: This PR adds an extra 16KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7657dc/52795/summary.html
COMMIT: 4e3de58
CMSSW: CMSSW_17_0_X_2026-04-20-2300/el8_amd64_gcc13
Additional Tests: HLT_P2_INTEGRATION,HLT_P2_TIMING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/50679/52795/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

You potentially added 3 lines to the logs
Reco comparison results: 7 differences found in the comparisons
DQMHistoTests: Total files compared: 68
DQMHistoTests: Total histograms compared: 4803106
DQMHistoTests: Total failures: 4340
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 4798746
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 67 files compared)
Checked 282 log files, 243 edm output root files, 68 DQM output files
TriggerResults: no differences found

slava77 · 2026-04-23T18:33:01Z

Failed Tests: HLTP2Timing

I was looking at this PR timing log compared to a "reference" run with #50479 log

@mmusich

do I understand correctly that the memory is polled at 1 sec interval and it continues in a separate process from cmsRun across crashes? I see a sequence of
- + now=1776782011 + elapsed=78 + max_mem=29250 16:33:31 aligned with exception message at 16:33:29
- + now=1776782149 + elapsed=216 + max_mem=29380 16:35:49 with fatal exception at 16:35:48
- now=1776782316 + elapsed=383 + max_mem=29662 16:38:36with fatal exception at16:38:36`
- Notably in this same Phase2_L1P2GT_HLT the reference has a max of 29784 MiB , above the max at the time of crashes
Is the mem profile vs time available somewhere? I wanted to see when the reference reaches the max
BTW, wasn't the timing supposed to be restricted to one T4? (the max mem is close to 30 GB, twice the size of a T4.)

mmusich · 2026-04-23T18:38:01Z

@slava77

do I understand correctly that the memory is polled at 1 sec interval and it continues in a separate process from cmsRun across crashes

yes.

Is the mem profile vs time available somewhere? I wanted to see when the reference reaches the max

not yet, see #50479 (comment).

BTW, wasn't the timing supposed to be restricted to one T4? (the max mem is close to 30 GB, twice the size of a T4.)

no, we're using both GPUs. Only one CPU socket (out of two) is used in order to have 50/50 compute split.

ariostas · 2026-04-28T20:08:15Z

I've been doing some debugging, and I'm puzzled with what I've been finding.

I found that to reliably and clearly reproduce the issue, it's better to restrict to a single GPU, use 1 job, and 16 threads/streams. I'm using the runHLTTiming script, but only running run_phase2_gpu (and setting -j 1 -t 16 -s 16).

I made a new branch that adds this extra commit SegmentLinking@a9ab182. The commit just switches back to loading the files from the binary files instead of using the ES product, but just leaves all the setup in place. With this setup, the VRAM usage still increases a lot, even though the ESProducer is CPU-only and the product is not used at all.

However, by simply commenting out this line the issue is resolved. Here is a plot comparing VRAM usage with and without that line.

So it seems that just having the ESProducer run causes VRAM usage to increase, even though it is purely constructed on the host, and the product is not being used. I find this very confusing, so I was wondering if you have any suggestions.

I should mention that if I dial it back to 1 thread/stream, then everything looks identical in both cases. Also, I have tried to profile it with nsys, but it gets stuck when I try to use more than 1 stream.

makortel · 2026-04-28T21:45:29Z

So it seems that just having the ESProducer run causes VRAM usage to increase, even though it is purely constructed on the host, and the product is not being used. I find this very confusing, so I was wondering if you have any suggestions.

The situation almost smells like (or, that would be the easiest explanation I could quickly think of) some other component would be consuming an ES data product on the device and that would trigger the production, but in a way that the component does not fail if the data product is missing. Is e.g. EventSetupRecordDataGetter used in any way (from quick git grep I'd guess "no", but maybe better to ask / check explicitly)?

Does the behavior of excessive memory usage reproduce on 1 thread/stream? Does the behavior reproduce if processing only few (down to 1) events?

If the answers are "yes", I'd suggest to add the Tracer service

process.add_(cms.Service("Tracer", dumpPathsAndConsumes=cms.untracked.bool(True)))

and put the (large) log somewhere accessible. This service prints every framework transition for every module, and when configured like this also the ED and ES data product consumption information.

ariostas · 2026-04-29T18:23:37Z

Does the behavior of excessive memory usage reproduce on 1 thread/stream?

No, for 1 thread/strem everything looks normal.

Does the behavior reproduce if processing only few (down to 1) events?

Yeah, it still happens with only a few events.

I'd suggest to add the Tracer service

Here's log with the tracer: part1 part2.

Nothing seems obviously wrong. LSTModulesDevESProducer@alpaka/'hltESPModulesDevLST' is marked as consuming LSTGeometryESProducer/'hltLSTGeometry', but as I mentioned, it is not actually used because it is commented out. I don't see any obvious duplication of products or anything like that.

Dr15Jones · 2026-04-29T18:53:14Z

Nothing seems obviously wrong. LSTModulesDevESProducer@alpaka/'hltESPModulesDevLST' is marked as consuming LSTGeometryESProducer/'hltLSTGeometry', but as I mentioned, it is not actually used because it is commented out. I don't see any obvious duplication of products or anything like that.

commenting out the request in produce is not enough. Saying you consume the item will cause the framework to prefetch it. So to actually keep the module from being called requires to that no module say they consume it.

ariostas · 2026-04-29T19:09:23Z

commenting out the request in produce is not enough. Saying you consume the item will cause the framework to prefetch it. So to actually keep the module from being called requires to that no module say they consume it.

Well if I just comment out the consume it's back to normal. The point is that somehow the module being called is causing VRAM usage to increase even though it's a CPU module and the product is never used, so it should have no effect on VRAM usage.

makortel · 2026-04-29T19:15:26Z

Nothing seems obviously wrong. LSTModulesDevESProducer@alpaka/'hltESPModulesDevLST' is marked as consuming LSTGeometryESProducer/'hltLSTGeometry', but as I mentioned, it is not actually used because it is commented out. I don't see any obvious duplication of products or anything like that.

commenting out the request in produce is not enough. Saying you consume the item will cause the framework to prefetch it. So to actually keep the module from being called requires to that no module say they consume it.

Right. This behavior is visible in the Tracer log as well:

++++++++++++ starting: processing esmodule: label = 'hltLSTGeometry' type = LSTGeometryESProducer in record = TrackerRecoGeometryRecord
<cut>
++++++++++++ finished: processing esmodule: label = 'hltLSTGeometry' type = LSTGeometryESProducer in record = TrackerRecoGeometryRecord
++++++++++ finished: prefetching for esmodule: label = 'hltESPModulesDevLST' type = LSTModulesDevESProducer@alpaka in record = TrackerRecoGeometryRecord
++++++++++ starting: processing esmodule: label = 'hltESPModulesDevLST' type = LSTModulesDevESProducer@alpaka in record = TrackerRecoGeometryRecord
++++++++++ finished: processing esmodule: label = 'hltESPModulesDevLST' type = LSTModulesDevESProducer@alpaka in record = TrackerRecoGeometryRecord

So when you

simply commenting out this line

the hltLSTGeometry can't be run, and it does not result in an error because the only consumer does not actually access the data because of these lines being commented out
https://github.com/SegmentLinking/cmssw/blob/a9ab18292aa3f5a4b0774aecec84d628f17a544a/RecoTracker/LST/plugins/alpaka/LSTModulesDevESProducer.cc#L40-L42

This analysis does not answer to the question on how LSTGeometryESProducer leads to GPU memory being used.

makortel · 2026-04-29T19:19:53Z

This analysis does not answer to the question on how LSTGeometryESProducer leads to GPU memory being used.

The Tracer log shows only LSTModulesDevESProducer@alpaka/'hltESPModulesDevLST' consuming the data product of hltLSTGeometry (and from the code only the host data product is consumed). The log also shows that only one produce call is made on hltLSTGeometry (i.e. no sign of implicit host-to-device copy; well, there can't be because LSTGeometryESProducer is not an Alpaka module).

makortel · 2026-04-29T19:31:37Z

Does the behavior of excessive memory usage reproduce on 1 thread/stream?

No, for 1 thread/strem everything looks normal.

If 1 thread/stream shows "good behavior", I'm wondering if the caching allocator could play a role. The allocator is shared, and if some modules allocate concurrently large temporary buffers, those buffers might end up being held by the caching allocator without being used later in the job. On 1 thread these temporary buffers would be allocated and deallocated serially, and the same large buffer could be used by multiple modules.

But this is, of course, pure speculation, and does not explain the role of the existence of hltLSTGeometry in the GPU memory usage.

makortel · 2026-04-29T19:45:23Z

The CachingAllocator hypothesis could be investigated further by comparing the behavior between 1-thread and many-thread cases (on a few events).

The debug prints of the CachingAllocator can be enabled with

if not hasattr(process, "AlpakaServiceCudaAsync"):
    process.load("HeterogeneousCore.AlpakaServices.AlpakaServiceCudaAsync_cfi")
    process.AlpakaServiceCudaAsync.verbose = True

A crude way to see the functions that lead to actual memory allocations would be

cmsTraceFunction "cms::alpakatools::CachingAllocator<alpaka::DevCudaRt, alpaka::QueueCudaRtNonBlocking>::allocateBuffer" cmsRun ...

(I'm not 100 % sure I got the CachingAllocator template instantiation right, possibly tracing calls to just cudaMalloc might also do the trick)

ariostas · 2026-04-29T19:53:31Z

The CachingAllocator hypothesis could be investigated further...

I'm currently recompiling everything after adding

<flags CXXFLAGS="-DALPAKA_DISABLE_CACHING_ALLOCATOR -DALPAKA_DISABLE_ASYNC_ALLOCATOR"/>

to all the LST build files. I'll see what happens and try using the debug prints. Thanks!

ariostas · 2026-04-29T20:29:41Z

Okay, so disabling the caching allocator shows that there's this big spike. So it's not actually the caching allocator itself, but something is shortly allocating a big chunk of memory.

I'll try tracing cachingallocator/cudaMalloc calls to see if I can pinpoint what's happening.

ariostas · 2026-05-01T16:33:52Z

I couldn't get cmsTraceFunction to work. Not sure why, but it didn't trace any function that I tried.

By using gdb directly, I looked at calls to cudaMalloc and I see that there is no single giant allocation, but rather there are many more allocations than before (more than double).

I'm going to be on vacation for the next 2 weeks. But I'll keep looking into this when I get back.

slava77 · 2026-05-01T16:58:54Z

can it be something to do with the number of queues (and subsequently some extra allocations coming per queue)?
How is the number of queues defined: can it vary randomly (presumably repeatable but varying from unrelated changes)?

mmusich · 2026-05-07T03:23:17Z

In addition to the problems already discussed, now this branch has conflicts that must be resolved. @ariostas

Added LSTGeometry package and associated ESProducer

e612f24

cmsbuild added this to the CMSSW_17_0_X milestone Apr 7, 2026

cmsbuild added reconstruction-pending hlt-pending pending-signatures tests-pending orp-pending new-package-pending code-checks-pending tracking labels Apr 7, 2026

cmsbuild added code-checks-approved and removed code-checks-pending labels Apr 7, 2026

ariostas mentioned this pull request Apr 7, 2026

Add package RecoTracker/LSTGeometry to reconstruction cms-sw/cms-bot#2716

Merged

cmsbuild added tests-started and removed tests-pending labels Apr 7, 2026

cmsbuild added tests-rejected and removed tests-started labels Apr 7, 2026

makortel reviewed Apr 7, 2026

View reviewed changes

Comment thread RecoTracker/LSTGeometry/test/dumpLSTGeometry.py Outdated

cmsbuild removed the tests-rejected label Apr 7, 2026

cmsbuild added requires-external tests-started and removed tests-rejected labels Apr 21, 2026

cmsbuild added tests-rejected and removed tests-started labels Apr 21, 2026

cmsbuild mentioned this pull request Apr 22, 2026

Adding pass-through option to MultiTrackSelector (for LST tracking / seeding) #50467

Merged

mmusich mentioned this pull request Apr 24, 2026

add GPU memory and usage to the phase2 HLT timing check #50479

Merged

cmsbuild mentioned this pull request Apr 27, 2026

Simplify LST input & output and some clean up of the HLT tracking-related sequence. #50819

Open

cmsbuild mentioned this pull request May 1, 2026

LST: Remove Matrix Caps for MDs using Precompute #50856

Merged

Conversation

ariostas commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmsbuild commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmsbuild commented Apr 7, 2026

Uh oh!

cmsbuild commented Apr 7, 2026

Uh oh!

mmusich commented Apr 7, 2026

Uh oh!

mmusich commented Apr 7, 2026

Uh oh!

cmsbuild commented Apr 7, 2026

Failed Unit Tests

Comparison Summary

Max Memory Comparisons exceeding threshold

Uh oh!

makortel commented Apr 7, 2026

Uh oh!

Uh oh!

ariostas commented Apr 7, 2026

Uh oh!

makortel commented Apr 7, 2026

Uh oh!

makortel commented Apr 7, 2026

Uh oh!

makortel commented Apr 7, 2026

Uh oh!

mmusich commented Apr 21, 2026

Uh oh!

ariostas commented Apr 21, 2026

Uh oh!

slava77 commented Apr 21, 2026

Uh oh!

mmusich commented Apr 21, 2026

Uh oh!

cmsbuild commented Apr 21, 2026

Comparison Summary

Uh oh!

slava77 commented Apr 23, 2026

Uh oh!

mmusich commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ariostas commented Apr 28, 2026

Uh oh!

makortel commented Apr 28, 2026

Uh oh!

ariostas commented Apr 29, 2026

Uh oh!

Dr15Jones commented Apr 29, 2026

Uh oh!

ariostas commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

makortel commented Apr 29, 2026

Uh oh!

makortel commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

makortel commented Apr 29, 2026

Uh oh!

makortel commented Apr 29, 2026

Uh oh!

ariostas commented Apr 29, 2026

Uh oh!

ariostas commented Apr 29, 2026

Uh oh!

ariostas commented May 1, 2026

Uh oh!

slava77 commented May 1, 2026

Uh oh!

mmusich commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

ariostas commented Apr 7, 2026 •

edited

Loading

cmsbuild commented Apr 7, 2026 •

edited

Loading

mmusich commented Apr 23, 2026 •

edited

Loading

ariostas commented Apr 29, 2026 •

edited

Loading

makortel commented Apr 29, 2026 •

edited

Loading