You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running a recent HLT menu in CMSSW_15_0_0_pre2, I see the runtime error in [1].
Some facts and circumstancial evidence.
The issue is not fully reproducible (so, right now I don't really have a reproducer).
The issue happens pretty frequently in the HLT workflow I'm testing. The latter consists in running 8 jobs in parallel, each with 32 threads and 24 concurrent events, on a machine with the same hardware as a "2022 HLT node", e.g. hilton-c2b02-44-01 (2 AMD Milan CPUs + 2 NVIDIA GPUs). Fwiw, a readme + example of what I'm running is in [2] and [3] (the recipe assumes the use of one of the HLT/GPU nodes in the CMS online network; the instructions could be adapted to lxplus if needed).
I have seen the issue with and without offloading to GPUs ("without" meaning options.accelerators = ["cpu"]).
I have seen the issue starting with CMSSW_15_0_0_pre2, and I see it also in more recent 15_0_X IBs.
I ran the same workflow more than once in CMSSW_15_0_0_pre1, and I have not seen this runtime error in that pre-release so far.
So far, I failed to reproduce the problem with simpler configurations (as opposed to a full-blown HLT menu running on multiple jobs using all threads).
Talking to @fwyzard and @makortel, the issue looks compatible with a race condition.
One suggestion by @makortel was to check if the error occurs at the very beginning of the job or later. I managed to reproduce the error enabling more MessageLogger outputs, and it happened ~60 events into the job (using 32 threads and 24 streams in the job), so early on in the job but not at the very beginning.
FYI: @cms-sw/hlt-l2
Edit-1 (Feb-09): a script used to reproduce the error on lxplus was added in #47287 (comment).
[1]
----- Begin Fatal Exception 26-Jan-2025 10:02:31 CET-----------------------
An exception of category 'FatalRootError' occurred while
[0] Processing Event run: 386593 lumi: 94 event: 213402124 stream: 6
[1] Running path '@finalPath'
[2] Calling method for module GlobalEvFOutputModule/'hltOutputParkingSingleMuon8'
Additional Info:
[a] Fatal Root Error: @SUB=TClass::StreamerDefault
fStreamerImpl not properly initialized (0)
----- End Fatal Exception -------------------------------------------------
Running a recent HLT menu in
CMSSW_15_0_0_pre2, I see the runtime error in [1].Some facts and circumstancial evidence.
hilton-c2b02-44-01(2 AMD Milan CPUs + 2 NVIDIA GPUs). Fwiw, a readme + example of what I'm running is in [2] and [3] (the recipe assumes the use of one of the HLT/GPU nodes in the CMS online network; the instructions could be adapted tolxplusif needed).options.accelerators = ["cpu"]).CMSSW_15_0_0_pre2, and I see it also in more recent15_0_XIBs.CMSSW_15_0_0_pre1, and I have not seen this runtime error in that pre-release so far.15_0_0_pre2and seemed to me loosely related to output modules, I locally reverted both PRs on top of15_0_0_pre2, and I still see the same runtime error as [1].Talking to @fwyzard and @makortel, the issue looks compatible with a race condition.
One suggestion by @makortel was to check if the error occurs at the very beginning of the job or later. I managed to reproduce the error enabling more
MessageLoggeroutputs, and it happened ~60 events into the job (using 32 threads and 24 streams in the job), so early on in the job but not at the very beginning.FYI: @cms-sw/hlt-l2
Edit-1 (Feb-09): a script used to reproduce the error on
lxpluswas added in #47287 (comment).[1]
[2]
[3]