Skip to content

add GPU memory and usage to the phase2 HLT timing check#50479

Merged
cmsbuild merged 2 commits intocms-sw:masterfrom
mmusich:mm_dev_hlt_timing_gpu_memory
Apr 23, 2026
Merged

add GPU memory and usage to the phase2 HLT timing check#50479
cmsbuild merged 2 commits intocms-sw:masterfrom
mmusich:mm_dev_hlt_timing_gpu_memory

Conversation

@mmusich
Copy link
Copy Markdown
Contributor

@mmusich mmusich commented Mar 20, 2026

PR description:

Title says it all.

PR validation:

To be tested by the bot.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

N/A

@mmusich
Copy link
Copy Markdown
Contributor Author

mmusich commented Mar 20, 2026

test parameters:

  • enable = hlt_p2_timing

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Mar 20, 2026

cms-bot internal usage

@cmsbuild
Copy link
Copy Markdown
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50479/48627

@mmusich
Copy link
Copy Markdown
Contributor Author

mmusich commented Mar 20, 2026

@cmsbuild, please test

@cmsbuild
Copy link
Copy Markdown
Contributor

+1

Size: This PR adds an extra 24KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ab0e6b/52111/summary.html
COMMIT: 8db7080
CMSSW: CMSSW_16_1_X_2026-03-19-2300/el8_amd64_gcc13
Additional Tests: HLT_P2_TIMING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/50479/52111/install.sh to create a dev area with all the needed externals and cmssw changes.

HLT P2 Timing: chart

Comparison Summary

Summary:

  • You potentially added 3 lines to the logs
  • Reco comparison results: 9 differences found in the comparisons
  • DQMHistoTests: Total files compared: 53
  • DQMHistoTests: Total histograms compared: 4185284
  • DQMHistoTests: Total failures: 114
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4185150
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 52 files compared)
  • Checked 227 log files, 198 edm output root files, 53 DQM output files
  • TriggerResults: no differences found

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 4, 2026

Milestone for this pull request has been moved to CMSSW_17_0_X. Please open a backport if it should also go in to CMSSW_16_1_X.

@cmsbuild
Copy link
Copy Markdown
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50479/49068

@cmsbuild
Copy link
Copy Markdown
Contributor

Pull request #50479 was updated.

@cmsbuild
Copy link
Copy Markdown
Contributor

+1

Size: This PR adds an extra 24KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ab0e6b/52764/summary.html
COMMIT: 72d8f5c
CMSSW: CMSSW_17_0_X_2026-04-19-2300/el8_amd64_gcc13
Additional Tests: HLT_P2_TIMING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/50479/52764/install.sh to create a dev area with all the needed externals and cmssw changes.

HLT P2 Timing: chart

Comparison Summary

Summary:

  • You potentially removed 3 lines from the logs
  • Reco comparison results: 6 differences found in the comparisons
  • DQMHistoTests: Total files compared: 53
  • DQMHistoTests: Total histograms compared: 4186813
  • DQMHistoTests: Total failures: 32
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4186761
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 52 files compared)
  • Checked 227 log files, 197 edm output root files, 53 DQM output files
  • TriggerResults: no differences found

@mmusich mmusich changed the title DRAFT: add GPU memory and usage to the phase2 HLT timing check add GPU memory and usage to the phase2 HLT timing check Apr 20, 2026
@mmusich mmusich marked this pull request as ready for review April 20, 2026 13:52
@cmsbuild
Copy link
Copy Markdown
Contributor

A new Pull Request was created by @mmusich for master.

It involves the following packages:

  • HLTrigger/Configuration (hlt)

@Martin-Grunewald, @mmusich can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @SohamBhattacharya, @VourMa, @missirol, @rovere this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@mmusich
Copy link
Copy Markdown
Contributor Author

mmusich commented Apr 21, 2026

+hlt

@cmsbuild
Copy link
Copy Markdown
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @mandrenguyen, @ftenchini (and backports should be raised in the release meeting by the corresponding L2)
Notice This PR was tested with additional Pull Request(s), please also merge them if necessary: cms-sw/cms-bot#2725

@mandrenguyen
Copy link
Copy Markdown
Contributor

+1

@cmsbuild cmsbuild merged commit 4ce1e8d into cms-sw:master Apr 23, 2026
12 checks passed
@mmusich mmusich deleted the mm_dev_hlt_timing_gpu_memory branch April 23, 2026 06:15
@slava77
Copy link
Copy Markdown
Contributor

slava77 commented Apr 23, 2026

where do the details of the memory usage go? is it just the timing log like

Benchmarking NGTScouting_L1P2GT_HLT.py
...
----- GPU SUMMARY -----
Peak memory: 26994 MiB
Mean memory: 19726 MiB

@mmusich
Copy link
Copy Markdown
Contributor Author

mmusich commented Apr 23, 2026

where do the details of the memory usage go?

they go into csv files which are not percolated yet, for the moment you get the summary in the log.
The next step is to update the bot code to fetch them and create nice charts.

@mmusich
Copy link
Copy Markdown
Contributor Author

mmusich commented Apr 24, 2026

Hi @smuzaffar I am afraid this PR broke the Phase2 HLT timing test in the 17.0.X IBs, see e.g.:

On the other hand I am a little puzzled, because the log files look fine to me:

Also if I am not missing anything the jenkins log claim SUCCESS:

Do you have any hints to what might be wrong?

@smuzaffar
Copy link
Copy Markdown
Contributor

@mmusich , I see the following message in log file

./runHLTTiming.sh: line 151: 2069485 Terminated              tail -f "$TMP_LOG_FILE"

same messages are there for PR tests too

/cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/50479/52764/CMSSW_17_0_X_2026-04-19-2300/src/HLTrigger/Configuration/python/HLT_75e33/test/runHLTTiming.sh: line 151: 3811135 Terminated              tail -f "$TMP_LOG_FILE"

the way we run the test for PR is timeout $TIMEOUT bash -e ${HLT_BASEDIR}/${HLT_P2_SCRIPT}/runHLTTiming.sh 2>&1 | tee -a ${WORKSPACE}/hlt-p2-timing.log and If I am not wrong the tee command basically eats up the exit code and never return with non-zero code that is why Jenkins job did not fail.

@mmusich
Copy link
Copy Markdown
Contributor Author

mmusich commented Apr 24, 2026

If I am not wrong the tee command basically eats up the exit code and never return with non-zero code that is why Jenkins job did not fail.

@smuzaffar , but then how e.g. are tests failing here: #50679 (comment) ?
That test was run with the changes of this PR

@smuzaffar
Copy link
Copy Markdown
Contributor

smuzaffar commented Apr 24, 2026

there there are failing because after running runHLTTiming.sh https://github.com/cms-sw/cms-bot/blob/master/pr_testing/run-pr-hlt-p2-timing.sh assums that there should be few json file generated which cms-bot script was not able to find and then marked the job faild.

In recent failure I see

----- Begin Fatal Exception 21-Apr-2026 16:40:55 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 74 event: 7315 stream: 10
   [1] Running path 'HLT_AK4PFPuppiJet520'
   [2] Calling method for module LSTProducer@alpaka/'hltLST'
Exception Message:
A std::exception was thrown.
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02938/el8_amd64_gcc13/external/alpaka/2.1.1-3caaac8d71f39d400ab2511b2403675a/include/alpaka/mem/buf/uniformCudaHip/traits/BufUniformCudaHipRtTraits.hpp(212) 'TApi::malloc(&memPtr, static_cast<std::size_t>(getWidth(extent)) * sizeof(TElem))' returned error  : 'cudaErrorMemoryAllocation': 'out of memory'!
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 21-Apr-2026 16:40:55 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 74 event: 7311 stream: 14
   [1] Running path 'HLT_PFPuppiMETTypeOne140_PFPuppiMHT140'
   [2] Calling method for module LSTProducer@alpaka/'hltLST'
Exception Message:
A std::exception was thrown.
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02938/el8_amd64_gcc13/external/alpaka/2.1.1-3caaac8d71f39d400ab2511b2403675a/include/alpaka/mem/buf/uniformCudaHip/traits/BufUniformCudaHipRtTraits.hpp(212) 'TApi::malloc(&memPtr, static_cast<std::size_t>(getWidth(extent)) * sizeof(TElem))' returned error  : 'cudaErrorMemoryAllocation': 'out of memory'!
----- End Fatal Exception -------------------------------------------------

and I guess due to these failures the output json files were not created while when PR tests run for this the above error did not happen and test mighthave created the needed json files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants