Skip to content

LST: add LSTGeometry package and associated ESProducer#50679

Open
ariostas wants to merge 4 commits intocms-sw:masterfrom
SegmentLinking:ariostas/lst_geometry
Open

LST: add LSTGeometry package and associated ESProducer#50679
ariostas wants to merge 4 commits intocms-sw:masterfrom
SegmentLinking:ariostas/lst_geometry

Conversation

@ariostas
Copy link
Copy Markdown
Contributor

@ariostas ariostas commented Apr 7, 2026

This PR adds a new RecoTracker/LSTGeometry package containing the module map computation used by the LST algorithm. Currently, the maps are pre-computed by the code in https://github.com/SegmentLinking/LSTGeometry and they are stored in https://github.com/cms-data/RecoTracker-LSTCore. This PR allows for the on-the-fly computation of these maps via an ESProducer, ensuring that they stay consistent with the tracker geometry being used.

This is the last major task in #46746.

c.c. @slava77

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 7, 2026

cms-bot internal usage

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 7, 2026

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50679/48907

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 7, 2026

A new Pull Request was created by @ariostas for master.

It involves the following packages:

  • HLTrigger/Configuration (hlt)
  • RecoTracker/IterativeTracking (reconstruction)
  • RecoTracker/LST (reconstruction)
  • RecoTracker/LSTCore (reconstruction)
  • RecoTracker/LSTGeometry (****)

The following packages do not have a category, yet:

RecoTracker/LSTGeometry
Please create a PR for https://github.com/cms-sw/cms-bot/blob/master/categories_map.py to assign category

@Martin-Grunewald, @Moanwar, @cmsbuild, @jfernan2, @mandrenguyen, @mmusich, @srimanob can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @Martin-Grunewald, @SohamBhattacharya, @VinInn, @VourMa, @dgulhan, @elusian, @felicepantaleo, @gpetruc, @missirol, @mmasciov, @mmusich, @mtosi, @rovere this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Apr 7, 2026

test parameters:

  • enable = hlt_p2_integration, hlt_p2_timing
  • workflows = ph2_hlt

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Apr 7, 2026

@cmsbuild, please test

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 7, 2026

-1

Failed Tests: UnitTests HLTP2Timing
Size: This PR adds an extra 104KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7657dc/52513/summary.html
COMMIT: e612f24
CMSSW: CMSSW_17_0_X_2026-04-07-1100/el8_amd64_gcc13
Additional Tests: HLT_P2_INTEGRATION,HLT_P2_TIMING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/50679/52513/install.sh to create a dev area with all the needed externals and cmssw changes.

Failed Unit Tests

I found 1 errors in the following unit tests:

---> test test-das-selected-lumis had ERRORS

Comparison Summary

Summary:

Max Memory Comparisons exceeding threshold

@cms-sw/core-l2 , I found 17 workflow step(s) with memory usage exceeding the error threshold:

Expand to see workflows ...
  • Error: Workflow 34434.0_TTbar_14TeV+Run4D121 step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.75_TTbar_14TeV+Run4D121_HLT75e33Timing step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.7501_TTbar_14TeV+Run4D121_HLT75e33TrackingOnly step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.7502_TTbar_14TeV+Run4D121_HLT75e33TrackingNtuple step2 max memory diff 191.9 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.751_TTbar_14TeV+Run4D121_HLT75e33TimingAlpaka step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.752_TTbar_14TeV+Run4D121_HLT75e33TimingTiclV5 step2 max memory diff 189.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.7521_TTbar_14TeV+Run4D121_HLT75e33TimingTiclV5TrackLinkGNN step2 max memory diff 166.0 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.755_TTbar_14TeV+Run4D121_HLT75e33TimingLST step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.756_TTbar_14TeV+Run4D121_HLT75e33TimingTrimmedTracking step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.757_TTbar_14TeV+Run4D121_HLT75e33TimingMkFitFit step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.758_TTbar_14TeV+Run4D121_HLT75e33TimingTiclBarrel step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.759_TTbar_14TeV+Run4D121_HLTPhase2WithNano step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.775_TTbar_14TeV+Run4D121_NGTScoutingCAExtensionMergeT5 step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.911_TTbar_14TeV+Run4D121_DD4hep step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34496.0_CloseByPGun_CE_E_Front_120um+Run4D121 step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34500.0_CloseByPGun_CE_H_Coarse_Scint+Run4D121 step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34634.999_TTbar_14TeV+Run4D121PU_PMXS1S2PR step3 max memory diff 191.8 exceeds +/- 90.0 MiB

@makortel
Copy link
Copy Markdown
Contributor

makortel commented Apr 7, 2026

Is ~190 MB increase in memory usage expected?

Comment thread RecoTracker/LSTGeometry/test/dumpLSTGeometry.py Outdated
@ariostas
Copy link
Copy Markdown
Contributor Author

ariostas commented Apr 7, 2026

Is ~190 MB increase in memory usage expected?

That seems a bit high, but it's likely. I'll double-check. Either way, it is only temporarily. Most of it is freed once the maps are constructed.

@makortel
Copy link
Copy Markdown
Contributor

makortel commented Apr 7, 2026

Is ~190 MB increase in memory usage expected?

That seems a bit high, but it's likely. I'll double-check. Either way, it is only temporarily. Most of it is freed once the maps are constructed.

According to the monitoring the peak memory usage would increase by ~190 MB, and thus freeing it afterwards doesn't help much if the job was killed because of going over the limit.

@makortel
Copy link
Copy Markdown
Contributor

makortel commented Apr 7, 2026

test parameters:

  • workflows_profiling = 34434.0
  • enable_tests = profiling

@makortel
Copy link
Copy Markdown
Contributor

makortel commented Apr 7, 2026

@cmsbuild, please test

Maybe one round of profiling tests would be worth it.

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Apr 21, 2026

@cmsbuild, please test with #50479

@ariostas
Copy link
Copy Markdown
Contributor Author

please clarify if you were able to reproduce a crash with An exception of category 'AsyncCallNotAllowed' occurred while as in the fractionally available logs from the HLTP2Timing tests on a machine with a T4 (e.g. lxplus-gpu).

I did back when I hadn't tightened the module maps and the undelying reason was that it was running out of memory. I assume the same thing is still happening. I'm looking into it to see what else could be constributing to higher vram usage.

@slava77
Copy link
Copy Markdown
Contributor

slava77 commented Apr 21, 2026

I did back when I hadn't tightened the module maps and the undelying reason was that it was running out of memory. I assume the same thing is still happening. I'm looking into it to see what else could be constributing to higher vram usage.

OK. I thought/misunderstood that it went away on that test machine with the latest updates.

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Apr 21, 2026

the fractionally available logs from the HLTP2Timing

now you have full logs link:

21-Apr-2026 16:32:44 CEST  Initiating request to open file file:/data/user/cmsbuild//store/relval/CMSSW_15_1_0_pre3/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/PU_150X_mcRun4_realistic_v1_STD_Run4D110_PU-v1/2590000/00c675dc-1517-4af7-8dd4-841e0668fefe.root
21-Apr-2026 16:32:58 CEST  Successfully opened file file:/data/user/cmsbuild//store/relval/CMSSW_15_1_0_pre3/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/PU_150X_mcRun4_realistic_v1_STD_Run4D110_PU-v1/2590000/00c675dc-1517-4af7-8dd4-841e0668fefe.root
Begin processing the 1st record. Run 1, Event 7301, LumiSection 74 on stream 12 at 21-Apr-2026 16:33:19.818 CEST
Begin processing the 2nd record. Run 1, Event 7302, LumiSection 74 on stream 7 at 21-Apr-2026 16:33:19.819 CEST
Begin processing the 3rd record. Run 1, Event 7303, LumiSection 74 on stream 4 at 21-Apr-2026 16:33:19.819 CEST
Begin processing the 4th record. Run 1, Event 7304, LumiSection 74 on stream 8 at 21-Apr-2026 16:33:19.819 CEST
Begin processing the 5th record. Run 1, Event 7305, LumiSection 74 on stream 14 at 21-Apr-2026 16:33:19.820 CEST
Begin processing the 6th record. Run 1, Event 7306, LumiSection 74 on stream 13 at 21-Apr-2026 16:33:19.820 CEST
Begin processing the 7th record. Run 1, Event 7307, LumiSection 74 on stream 2 at 21-Apr-2026 16:33:19.820 CEST
Begin processing the 8th record. Run 1, Event 7308, LumiSection 74 on stream 9 at 21-Apr-2026 16:33:19.820 CEST
Begin processing the 9th record. Run 1, Event 7309, LumiSection 74 on stream 1 at 21-Apr-2026 16:33:19.821 CEST
Begin processing the 10th record. Run 1, Event 7310, LumiSection 74 on stream 15 at 21-Apr-2026 16:33:19.821 CEST
Begin processing the 11th record. Run 1, Event 7311, LumiSection 74 on stream 10 at 21-Apr-2026 16:33:19.821 CEST
Begin processing the 12th record. Run 1, Event 7312, LumiSection 74 on stream 11 at 21-Apr-2026 16:33:19.821 CEST
Begin processing the 13th record. Run 1, Event 7313, LumiSection 74 on stream 5 at 21-Apr-2026 16:33:19.822 CEST
Begin processing the 14th record. Run 1, Event 7314, LumiSection 74 on stream 0 at 21-Apr-2026 16:33:19.822 CEST
Begin processing the 15th record. Run 1, Event 7315, LumiSection 74 on stream 6 at 21-Apr-2026 16:33:19.822 CEST
Begin processing the 16th record. Run 1, Event 7316, LumiSection 74 on stream 3 at 21-Apr-2026 16:33:19.822 CEST
Begin processing the 17th record. Run 1, Event 7317, LumiSection 74 on stream 7 at 21-Apr-2026 16:33:20.236 CEST
Begin processing the 18th record. Run 1, Event 7318, LumiSection 74 on stream 15 at 21-Apr-2026 16:33:22.029 CEST
Begin processing the 19th record. Run 1, Event 7319, LumiSection 74 on stream 12 at 21-Apr-2026 16:33:24.905 CEST
----- Begin Fatal Exception 21-Apr-2026 16:33:29 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 74 event: 7318 stream: 15
   [1] Running path 'HLT_DoublePFPuppiJets128_DoublePFPuppiBTagDeepFlavour_2p4'
   [2] Calling method for module LSTProducer@alpaka/'hltLST'
Exception Message:
A std::exception was thrown.
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02938/el8_amd64_gcc13/external/alpaka/2.1.1-3caaac8d71f39d400ab2511b2403675a/include/alpaka/mem/buf/uniformCudaHip/traits/BufUniformCudaHipRtTraits.hpp(212) 'TApi::malloc(&memPtr, static_cast<std::size_t>(getWidth(extent)) * sizeof(TElem))' returned error  : 'cudaErrorMemoryAllocation': 'out of memory'!
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 21-Apr-2026 16:33:29 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 74 event: 7304 stream: 8
   [1] Running path 'HLT_PFPuppiHT1070'
   [2] Calling method for module LSTProducer@alpaka/'hltLST'
Exception Message:
A std::exception was thrown.
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02938/el8_amd64_gcc13/external/alpaka/2.1.1-3caaac8d71f39d400ab2511b2403675a/include/alpaka/mem/buf/uniformCudaHip/traits/BufUniformCudaHipRtTraits.hpp(212) 'TApi::malloc(&memPtr, static_cast<std::size_t>(getWidth(extent)) * sizeof(TElem))' returned error  : 'cudaErrorMemoryAllocation': 'out of memory'!
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 21-Apr-2026 16:33:29 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 74 event: 7311 stream: 10
   [1] Running path 'HLT_DoublePFPuppiJets128_DoublePFPuppiBTagDeepFlavour_2p4'
   [2] Calling method for module LSTProducer@alpaka/'hltLST'
Exception Message:
A std::exception was thrown.
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02938/el8_amd64_gcc13/external/alpaka/2.1.1-3caaac8d71f39d400ab2511b2403675a/include/alpaka/mem/buf/uniformCudaHip/traits/BufUniformCudaHipRtTraits.hpp(212) 'TApi::malloc(&memPtr, static_cast<std::size_t>(getWidth(extent)) * sizeof(TElem))' returned error  : 'cudaErrorMemoryAllocation': 'out of memory'!
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 21-Apr-2026 16:33:29 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 74 event: 7315 stream: 6
   [1] Running path 'HLT_PFPuppiHT1070'
   [2] Calling method for module LSTProducer@alpaka/'hltLST'
Exception Message:
A std::exception was thrown.
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02938/el8_amd64_gcc13/external/alpaka/2.1.1-3caaac8d71f39d400ab2511b2403675a/include/alpaka/mem/buf/uniformCudaHip/traits/BufUniformCudaHipRtTraits.hpp(212) 'TApi::malloc(&memPtr, static_cast<std::size_t>(getWidth(extent)) * sizeof(TElem))' returned error  : 'cudaErrorMemoryAllocation': 'out of memory'!
----- End Fatal Exception -------------------------------------------------

@cmsbuild
Copy link
Copy Markdown
Contributor

-1

Failed Tests: HLTP2Timing
Size: This PR adds an extra 16KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7657dc/52795/summary.html
COMMIT: 4e3de58
CMSSW: CMSSW_17_0_X_2026-04-20-2300/el8_amd64_gcc13
Additional Tests: HLT_P2_INTEGRATION,HLT_P2_TIMING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/50679/52795/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 3 lines to the logs
  • Reco comparison results: 7 differences found in the comparisons
  • DQMHistoTests: Total files compared: 68
  • DQMHistoTests: Total histograms compared: 4803106
  • DQMHistoTests: Total failures: 4340
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4798746
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 67 files compared)
  • Checked 282 log files, 243 edm output root files, 68 DQM output files
  • TriggerResults: no differences found

@slava77
Copy link
Copy Markdown
Contributor

slava77 commented Apr 23, 2026

Failed Tests: HLTP2Timing

I was looking at this PR timing log compared to a "reference" run with #50479 log

@mmusich

  • do I understand correctly that the memory is polled at 1 sec interval and it continues in a separate process from cmsRun across crashes? I see a sequence of
    • + now=1776782011 + elapsed=78 + max_mem=29250 16:33:31 aligned with exception message at 16:33:29
    • + now=1776782149 + elapsed=216 + max_mem=29380 16:35:49 with fatal exception at 16:35:48
    • now=1776782316 + elapsed=383 + max_mem=29662 16:38:36with fatal exception at16:38:36`
    • Notably in this same Phase2_L1P2GT_HLT the reference has a max of 29784 MiB , above the max at the time of crashes
  • Is the mem profile vs time available somewhere? I wanted to see when the reference reaches the max
  • BTW, wasn't the timing supposed to be restricted to one T4? (the max mem is close to 30 GB, twice the size of a T4.)

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Apr 23, 2026

@slava77

do I understand correctly that the memory is polled at 1 sec interval and it continues in a separate process from cmsRun across crashes

yes.

Is the mem profile vs time available somewhere? I wanted to see when the reference reaches the max

not yet, see #50479 (comment).

BTW, wasn't the timing supposed to be restricted to one T4? (the max mem is close to 30 GB, twice the size of a T4.)

no, we're using both GPUs. Only one CPU socket (out of two) is used in order to have 50/50 compute split.

@ariostas
Copy link
Copy Markdown
Contributor Author

I've been doing some debugging, and I'm puzzled with what I've been finding.

I found that to reliably and clearly reproduce the issue, it's better to restrict to a single GPU, use 1 job, and 16 threads/streams. I'm using the runHLTTiming script, but only running run_phase2_gpu (and setting -j 1 -t 16 -s 16).

I made a new branch that adds this extra commit SegmentLinking@a9ab182. The commit just switches back to loading the files from the binary files instead of using the ES product, but just leaves all the setup in place. With this setup, the VRAM usage still increases a lot, even though the ESProducer is CPU-only and the product is not used at all.

However, by simply commenting out this line the issue is resolved. Here is a plot comparing VRAM usage with and without that line.

image

So it seems that just having the ESProducer run causes VRAM usage to increase, even though it is purely constructed on the host, and the product is not being used. I find this very confusing, so I was wondering if you have any suggestions.

I should mention that if I dial it back to 1 thread/stream, then everything looks identical in both cases. Also, I have tried to profile it with nsys, but it gets stuck when I try to use more than 1 stream.

@makortel
Copy link
Copy Markdown
Contributor

So it seems that just having the ESProducer run causes VRAM usage to increase, even though it is purely constructed on the host, and the product is not being used. I find this very confusing, so I was wondering if you have any suggestions.

The situation almost smells like (or, that would be the easiest explanation I could quickly think of) some other component would be consuming an ES data product on the device and that would trigger the production, but in a way that the component does not fail if the data product is missing. Is e.g. EventSetupRecordDataGetter used in any way (from quick git grep I'd guess "no", but maybe better to ask / check explicitly)?

Does the behavior of excessive memory usage reproduce on 1 thread/stream? Does the behavior reproduce if processing only few (down to 1) events?

If the answers are "yes", I'd suggest to add the Tracer service

process.add_(cms.Service("Tracer", dumpPathsAndConsumes=cms.untracked.bool(True)))

and put the (large) log somewhere accessible. This service prints every framework transition for every module, and when configured like this also the ED and ES data product consumption information.

@ariostas
Copy link
Copy Markdown
Contributor Author

Does the behavior of excessive memory usage reproduce on 1 thread/stream?

No, for 1 thread/strem everything looks normal.

Does the behavior reproduce if processing only few (down to 1) events?

Yeah, it still happens with only a few events.

I'd suggest to add the Tracer service

Here's log with the tracer: part1 part2.

Nothing seems obviously wrong. LSTModulesDevESProducer@alpaka/'hltESPModulesDevLST' is marked as consuming LSTGeometryESProducer/'hltLSTGeometry', but as I mentioned, it is not actually used because it is commented out. I don't see any obvious duplication of products or anything like that.

@Dr15Jones
Copy link
Copy Markdown
Contributor

Nothing seems obviously wrong. LSTModulesDevESProducer@alpaka/'hltESPModulesDevLST' is marked as consuming LSTGeometryESProducer/'hltLSTGeometry', but as I mentioned, it is not actually used because it is commented out. I don't see any obvious duplication of products or anything like that.

commenting out the request in produce is not enough. Saying you consume the item will cause the framework to prefetch it. So to actually keep the module from being called requires to that no module say they consume it.

@ariostas
Copy link
Copy Markdown
Contributor Author

ariostas commented Apr 29, 2026

commenting out the request in produce is not enough. Saying you consume the item will cause the framework to prefetch it. So to actually keep the module from being called requires to that no module say they consume it.

Well if I just comment out the consume it's back to normal. The point is that somehow the module being called is causing VRAM usage to increase even though it's a CPU module and the product is never used, so it should have no effect on VRAM usage.

@makortel
Copy link
Copy Markdown
Contributor

Nothing seems obviously wrong. LSTModulesDevESProducer@alpaka/'hltESPModulesDevLST' is marked as consuming LSTGeometryESProducer/'hltLSTGeometry', but as I mentioned, it is not actually used because it is commented out. I don't see any obvious duplication of products or anything like that.

commenting out the request in produce is not enough. Saying you consume the item will cause the framework to prefetch it. So to actually keep the module from being called requires to that no module say they consume it.

Right. This behavior is visible in the Tracer log as well:

++++++++++++ starting: processing esmodule: label = 'hltLSTGeometry' type = LSTGeometryESProducer in record = TrackerRecoGeometryRecord
<cut>
++++++++++++ finished: processing esmodule: label = 'hltLSTGeometry' type = LSTGeometryESProducer in record = TrackerRecoGeometryRecord
++++++++++ finished: prefetching for esmodule: label = 'hltESPModulesDevLST' type = LSTModulesDevESProducer@alpaka in record = TrackerRecoGeometryRecord
++++++++++ starting: processing esmodule: label = 'hltESPModulesDevLST' type = LSTModulesDevESProducer@alpaka in record = TrackerRecoGeometryRecord
++++++++++ finished: processing esmodule: label = 'hltESPModulesDevLST' type = LSTModulesDevESProducer@alpaka in record = TrackerRecoGeometryRecord

So when you

simply commenting out this line

the hltLSTGeometry can't be run, and it does not result in an error because the only consumer does not actually access the data because of these lines being commented out
https://github.com/SegmentLinking/cmssw/blob/a9ab18292aa3f5a4b0774aecec84d628f17a544a/RecoTracker/LST/plugins/alpaka/LSTModulesDevESProducer.cc#L40-L42

This analysis does not answer to the question on how LSTGeometryESProducer leads to GPU memory being used.

@makortel
Copy link
Copy Markdown
Contributor

makortel commented Apr 29, 2026

This analysis does not answer to the question on how LSTGeometryESProducer leads to GPU memory being used.

The Tracer log shows only LSTModulesDevESProducer@alpaka/'hltESPModulesDevLST' consuming the data product of hltLSTGeometry (and from the code only the host data product is consumed). The log also shows that only one produce call is made on hltLSTGeometry (i.e. no sign of implicit host-to-device copy; well, there can't be because LSTGeometryESProducer is not an Alpaka module).

@makortel
Copy link
Copy Markdown
Contributor

Does the behavior of excessive memory usage reproduce on 1 thread/stream?

No, for 1 thread/strem everything looks normal.

If 1 thread/stream shows "good behavior", I'm wondering if the caching allocator could play a role. The allocator is shared, and if some modules allocate concurrently large temporary buffers, those buffers might end up being held by the caching allocator without being used later in the job. On 1 thread these temporary buffers would be allocated and deallocated serially, and the same large buffer could be used by multiple modules.

But this is, of course, pure speculation, and does not explain the role of the existence of hltLSTGeometry in the GPU memory usage.

@makortel
Copy link
Copy Markdown
Contributor

The CachingAllocator hypothesis could be investigated further by comparing the behavior between 1-thread and many-thread cases (on a few events).

The debug prints of the CachingAllocator can be enabled with

if not hasattr(process, "AlpakaServiceCudaAsync"):
    process.load("HeterogeneousCore.AlpakaServices.AlpakaServiceCudaAsync_cfi")
    process.AlpakaServiceCudaAsync.verbose = True

A crude way to see the functions that lead to actual memory allocations would be

cmsTraceFunction "cms::alpakatools::CachingAllocator<alpaka::DevCudaRt, alpaka::QueueCudaRtNonBlocking>::allocateBuffer" cmsRun ...

(I'm not 100 % sure I got the CachingAllocator template instantiation right, possibly tracing calls to just cudaMalloc might also do the trick)

@ariostas
Copy link
Copy Markdown
Contributor Author

The CachingAllocator hypothesis could be investigated further...

I'm currently recompiling everything after adding

<flags CXXFLAGS="-DALPAKA_DISABLE_CACHING_ALLOCATOR -DALPAKA_DISABLE_ASYNC_ALLOCATOR"/>

to all the LST build files. I'll see what happens and try using the debug prints. Thanks!

@ariostas
Copy link
Copy Markdown
Contributor Author

Okay, so disabling the caching allocator shows that there's this big spike. So it's not actually the caching allocator itself, but something is shortly allocating a big chunk of memory.

image

I'll try tracing cachingallocator/cudaMalloc calls to see if I can pinpoint what's happening.

@ariostas
Copy link
Copy Markdown
Contributor Author

ariostas commented May 1, 2026

I couldn't get cmsTraceFunction to work. Not sure why, but it didn't trace any function that I tried.

By using gdb directly, I looked at calls to cudaMalloc and I see that there is no single giant allocation, but rather there are many more allocations than before (more than double).

image

I'm going to be on vacation for the next 2 weeks. But I'll keep looking into this when I get back.

@slava77
Copy link
Copy Markdown
Contributor

slava77 commented May 1, 2026

can it be something to do with the number of queues (and subsequently some extra allocations coming per queue)?
How is the number of queues defined: can it vary randomly (presumably repeatable but varying from unrelated changes)?

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented May 7, 2026

In addition to the problems already discussed, now this branch has conflicts that must be resolved. @ariostas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants