Skip to content

torch: reduce parallel build process#10174

Merged
smuzaffar merged 2 commits intoIB/CMSSW_16_0_X/masterfrom
torch-max-jobs
Nov 5, 2025
Merged

torch: reduce parallel build process#10174
smuzaffar merged 2 commits intoIB/CMSSW_16_0_X/masterfrom
torch-max-jobs

Conversation

@smuzaffar
Copy link
Copy Markdown
Contributor

On aarch64 nodes (20 cores with 58GB RAM) , torch failed to build as we build with -j 20 and 20 nvcc process consume too much memory and system start killing the processes. This change configure torch to use Max of MemoryGB/4.

@smuzaffar
Copy link
Copy Markdown
Contributor Author

please test for el8_aarch64_gcc13

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Nov 4, 2025

A new Pull Request was created by @smuzaffar for branch IB/CMSSW_16_0_X/master.

@akritkbehera, @iarspider, @smuzaffar can you please review it and eventually sign? Thanks.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Nov 4, 2025

cms-bot internal usage

@smuzaffar
Copy link
Copy Markdown
Contributor Author

please test for el8_aarch64_gcc13

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Nov 4, 2025

Pull request #10174 was updated.

@smuzaffar
Copy link
Copy Markdown
Contributor Author

please test

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Nov 5, 2025

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-52e027/49258/summary.html
COMMIT: f8037ac
CMSSW: CMSSW_16_0_X_2025-11-04-1100/el8_amd64_gcc13
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/10174/49258/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-52e027/49258/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-52e027/49258/git-merge-result

Comparison Summary

Summary:

@smuzaffar
Copy link
Copy Markdown
Contributor Author

+externals

Py3-torch successfully built for aarch64

@smuzaffar smuzaffar merged commit d8d6228 into IB/CMSSW_16_0_X/master Nov 5, 2025
9 of 10 checks passed
@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Nov 5, 2025

This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_16_0_X/master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @mandrenguyen, @ftenchini, @sextonkennedy (and backports should be raised in the release meeting by the corresponding L2)

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Nov 5, 2025

-1

Failed Tests: UnitTests RelVals
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-52e027/49259/summary.html
COMMIT: f8037ac
CMSSW: CMSSW_16_0_X_2025-11-02-2300/el8_aarch64_gcc13
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/10174/49259/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-52e027/49259/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-52e027/49259/git-merge-result

Failed Unit Tests

I found 5 errors in the following unit tests:

---> test DiMuonVall had ERRORS
---> test DMRall had ERRORS
---> test testCalibTrackerSiStripCommonAll had ERRORS
and more ...

Failed RelVals

----- Begin Fatal Exception 05-Nov-2025 09:12:01 CET-----------------------
An exception of category 'FallbackFileOpenError' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 0
   [1] Running path 'prevalidation_step'
   [2] Calling method for module MixingModule/'mix'
   [3] Calling RootInputFileSequence::initTheFile()
   [4] Calling StorageFactory::open()
   [5] Calling XrdFile::open()
Exception Message:
Failed to open the file 'root://xrootd-cms.infn.it//store/relval/CMSSW_10_6_0/RelValMinBias_13/GEN-SIM/106X_mcRun2_asymptotic_v3-v1/10000/45C5550A-82EF-D54D-B536-2D12D9CC673D.root'
   Additional Info:
      [a] Calling RootInputFileSequence::initTheFile(): fail to open the file with name root://cms-xrd-global.cern.ch//eos/cms/store/relval/CMSSW_10_6_0/RelValMinBias_13/GEN-SIM/106X_mcRun2_asymptotic_v3-v1/10000/45C5550A-82EF-D54D-B536-2D12D9CC673D.root
      [b] Calling RootInputFileSequence::initTheFile(): fail to open the file with name root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/relval/CMSSW_10_6_0/RelValMinBias_13/GEN-SIM/106X_mcRun2_asymptotic_v3-v1/10000/45C5550A-82EF-D54D-B536-2D12D9CC673D.root
      [c] Input file root://xrootd-cms.infn.it//store/relval/CMSSW_10_6_0/RelValMinBias_13/GEN-SIM/106X_mcRun2_asymptotic_v3-v1/10000/45C5550A-82EF-D54D-B536-2D12D9CC673D.root could not be opened.
      [d] XrdCl::File::Open(name='root://xrootd-cms.infn.it//store/relval/CMSSW_10_6_0/RelValMinBias_13/GEN-SIM/106X_mcRun2_asymptotic_v3-v1/10000/45C5550A-82EF-D54D-B536-2D12D9CC673D.root', flags=0x10, permissions=0660) => error '[ERROR] Server responded with an error: [3011] No servers are available to read the file.
' (errno=3011, code=400). No additional data servers were found.
      [e] Last URL tried: root://cms-xrd-global.cern.ch:1094//store/relval/CMSSW_10_6_0/RelValMinBias_13/GEN-SIM/106X_mcRun2_asymptotic_v3-v1/10000/45C5550A-82EF-D54D-B536-2D12D9CC673D.root?tried=+1213llrxrd-redir.in2p3.fr&xrdcl.requuid=d7ca0e0a-c7d4-490f-a95b-14b6fb828ff2
      [f] Problematic data server: cms-xrd-global.cern.ch:1094
      [g] Disabled source: cms-xrd-global.cern.ch:1094
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 05-Nov-2025 09:09:09 CET-----------------------
An exception of category 'FallbackFileOpenError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing module: class=MixingModule label='mix'
   [2] Calling RootInputFileSequence::initTheFile()
   [3] Calling StorageFactory::open()
   [4] Calling XrdFile::open()
Exception Message:
Failed to open the file 'root://xrootd-cms.infn.it//store/relval/CMSSW_13_0_11/RelValMinBias_14TeV/GEN-SIM-RECO/130X_mcRun3_2023_realistic_withEarly2023BS_v1_FastSim-v1/2580000/d4976755-655b-4715-bc0f-745eb585befe.root'
   Additional Info:
      [a] Calling RootInputFileSequence::initTheFile(): fail to open the file with name root://cms-xrd-global.cern.ch//eos/cms/store/relval/CMSSW_13_0_11/RelValMinBias_14TeV/GEN-SIM-RECO/130X_mcRun3_2023_realistic_withEarly2023BS_v1_FastSim-v1/2580000/d4976755-655b-4715-bc0f-745eb585befe.root
      [b] Calling RootInputFileSequence::initTheFile(): fail to open the file with name root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/relval/CMSSW_13_0_11/RelValMinBias_14TeV/GEN-SIM-RECO/130X_mcRun3_2023_realistic_withEarly2023BS_v1_FastSim-v1/2580000/d4976755-655b-4715-bc0f-745eb585befe.root
      [c] Input file root://xrootd-cms.infn.it//store/relval/CMSSW_13_0_11/RelValMinBias_14TeV/GEN-SIM-RECO/130X_mcRun3_2023_realistic_withEarly2023BS_v1_FastSim-v1/2580000/d4976755-655b-4715-bc0f-745eb585befe.root could not be opened.
      [d] XrdCl::File::Open(name='root://xrootd-cms.infn.it//store/relval/CMSSW_13_0_11/RelValMinBias_14TeV/GEN-SIM-RECO/130X_mcRun3_2023_realistic_withEarly2023BS_v1_FastSim-v1/2580000/d4976755-655b-4715-bc0f-745eb585befe.root', flags=0x10, permissions=0660) => error '[ERROR] Server responded with an error: [3011] No servers are available to read the file.
' (errno=3011, code=400). No additional data servers were found.
      [e] Last URL tried: root://cms-xrd-global.cern.ch:1094//store/relval/CMSSW_13_0_11/RelValMinBias_14TeV/GEN-SIM-RECO/130X_mcRun3_2023_realistic_withEarly2023BS_v1_FastSim-v1/2580000/d4976755-655b-4715-bc0f-745eb585befe.root?tried=+1213xrootd-redic.pi.infn.it&xrdcl.requuid=f11b28dd-9afc-4fa8-82d4-980bb6d887c2
      [f] Problematic data server: cms-xrd-global.cern.ch:1094
      [g] Disabled source: cms-xrd-global.cern.ch:1094
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 05-Nov-2025 09:05:34 CET-----------------------
An exception of category 'FallbackFileOpenError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing module: class=MixingModule label='mix'
   [2] Calling RootInputFileSequence::initTheFile()
   [3] Calling StorageFactory::open()
   [4] Calling XrdFile::open()
Exception Message:
Failed to open the file 'root://xrootd-cms.infn.it//store/relval/CMSSW_12_0_0_pre4/RelValMinBias_13/GEN-SIM/113X_mc2017_realistic_v5-v1/00000/a21693e9-4d25-496a-96e2-c28232a7a712.root'
   Additional Info:
      [a] Calling RootInputFileSequence::initTheFile(): fail to open the file with name root://cms-xrd-global.cern.ch//eos/cms/store/relval/CMSSW_12_0_0_pre4/RelValMinBias_13/GEN-SIM/113X_mc2017_realistic_v5-v1/00000/a21693e9-4d25-496a-96e2-c28232a7a712.root
      [b] Calling RootInputFileSequence::initTheFile(): fail to open the file with name root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/relval/CMSSW_12_0_0_pre4/RelValMinBias_13/GEN-SIM/113X_mc2017_realistic_v5-v1/00000/a21693e9-4d25-496a-96e2-c28232a7a712.root
      [c] Input file root://xrootd-cms.infn.it//store/relval/CMSSW_12_0_0_pre4/RelValMinBias_13/GEN-SIM/113X_mc2017_realistic_v5-v1/00000/a21693e9-4d25-496a-96e2-c28232a7a712.root could not be opened.
      [d] XrdCl::File::Open(name='root://xrootd-cms.infn.it//store/relval/CMSSW_12_0_0_pre4/RelValMinBias_13/GEN-SIM/113X_mc2017_realistic_v5-v1/00000/a21693e9-4d25-496a-96e2-c28232a7a712.root', flags=0x10, permissions=0660) => error '[ERROR] Server responded with an error: [3011] No servers are available to read the file.
' (errno=3011, code=400). No additional data servers were found.
      [e] Last URL tried: root://cms-xrd-global.cern.ch:1094//store/relval/CMSSW_12_0_0_pre4/RelValMinBias_13/GEN-SIM/113X_mc2017_realistic_v5-v1/00000/a21693e9-4d25-496a-96e2-c28232a7a712.root?tried=+1213llrxrd-redir.in2p3.fr&xrdcl.requuid=66a56f56-82e4-4518-a6d0-512e58bb1068
      [f] Problematic data server: cms-xrd-global.cern.ch:1094
      [g] Disabled source: cms-xrd-global.cern.ch:1094
----- End Fatal Exception -------------------------------------------------
Expand to see more relval errors ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants