Skip to content

Fix: Add missing buffer flush for PCL to MillePedeAlignmentAlgorithm#50447

Merged
cmsbuild merged 1 commit intocms-sw:CMSSW_16_0_Xfrom
goblirsc:MG_fixGBLflush_16_0_X
Mar 18, 2026
Merged

Fix: Add missing buffer flush for PCL to MillePedeAlignmentAlgorithm#50447
cmsbuild merged 1 commit intocms-sw:CMSSW_16_0_Xfrom
goblirsc:MG_fixGBLflush_16_0_X

Conversation

@goblirsc
Copy link
Copy Markdown
Contributor

@goblirsc goblirsc commented Mar 17, 2026

PR description:

This PR adds a missing buffer flush call to MillePedeAlignmentAlgorithm::endLuminosityBlock.
The missing call was resulting in the final dimuon track of a job to be truncated at the time the alignment FileBlob was written into the ALCARECO output, resulting in corrupted alignment data and failing high-granularity PCL alignment jobs.
Longer description below.

PR validation:

[ ✔️ ] Confirm that local re-running of PCL job with the change results in a succesful alignment
[ ✔️ ] Unit tests pass
[ ✔️ ] runTheMatrix WF 1000,1001,1001.2,1001.3,1001.4,1002.3,1002.4,1002 (modulo 5 x "input file not found" - known and unrelated to the PR)

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

This PR is only intended as a bugfix for 16_0_X to fix the ongoing PCL high-granularity alignment involving dimuons.
In master (already merged in #49963), a new writing pattern for Mille was adopted that also fixes this issue.

CC @lpnair

Detailed description of bug:

  • The MillePedeAlignmentAlgorithm has two separate Mille write buffers writing into the same file:
    • theMille: Used when a reference trajectory exists (MinBias)
    • theBinary: Used to talk to the GBL interface inside the algorithm, used for dimuon pairs
    • both write into the same file, but need separate calls to flush their buffers to disk.
  • the sequence of calls at the end of a PCL job is :
    1. MillePedeAlignmentAlgorithm::endLuminosityBlock
      • this flushes only theMille, but not theBinary
    2. MillePedeFileConverter::endLuminosityBlockProduce
      • this reads the Mille binary file from disk and writes a 1:1 carbon copy as a FileBlob into the output ALCARECO file.
    3. MillePedeAlignmentAlgorithm::terminate
      • this resets and closes both theMille and theBinary - leading to any remaining buffers being flushed to disk.
  • the bug: We are missing a final flush of the buffer from theBinary before producing the FileBlob
    • in our test example, the file at the time of file blob extraction is 46kB, but another 5kB (half of the final record) are written to the file after this call (in terminate) - and thus missed in the FileBlob.

Consequence:

  • The final record of the dimuon stream is truncated.
  • The corresponding binary is corrupt
    • from Mille's point of view - it is a valid file, but no complete Mille record can be decoded from it.

Proposed Fix in this PR: Make sure to flush the theBinary buffer to disk in MillePedeAlignmentAlgorithm::endLuminosityBlock with a minimal change.

@cmsbuild
Copy link
Copy Markdown
Contributor

Pull request #50447 was updated.

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Mar 17, 2026

cms-bot internal usage

@cmsbuild
Copy link
Copy Markdown
Contributor

Pull request #50447 was updated.

1 similar comment
@cmsbuild
Copy link
Copy Markdown
Contributor

Pull request #50447 was updated.

@goblirsc
Copy link
Copy Markdown
Contributor Author

Not directly relevant for the review of this PR, but in case any experts are reading this: Can you think of a reason why in 15_1_0, this issue did not result in truncated Mille binary files, in spite of (at first glance) identical logic and call ordering?

Are there any known changes between the two release families that could explain why this has become an issue now? @lpnair confirmed that even "old" workflows that used to run in 15_1_0 now produce corrupted files in 16_0_X, so this really seems to be release-specific and not tied to the particular 2026 inputs.

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Mar 17, 2026

@cms-sw/core-l2 do you have any hint on #50447 (comment) (see also PR description). It looks like something changed in the way calls are scheduled.

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Mar 17, 2026

@goblirsc please squash the commits to one.

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Mar 17, 2026

@goblirsc

WF 1000,1001,1001.2,1001.3,1001.4,1002.3,1002.4,1002 (modulo 5 x "input file not found" - known and unrelated to the PR)

as reported elsewhere, if you run with the --ibeos option, the input file not found should be solved.

@goblirsc
Copy link
Copy Markdown
Contributor Author

@goblirsc

WF 1000,1001,1001.2,1001.3,1001.4,1002.3,1002.4,1002 (modulo 5 x "input file not found" - known and unrelated to the PR)

as reported elsewhere, if you run with the --ibeos option, the input file not found should be solved.

Hi, I ran runTheMatrix.py -l 1000,1001,1001.2,1001.3,1001.4,1002.3,1002.4,1002.5 -i all --ibeos and this ran into the issue. Is the position of --ibeos important? I did also check for a valid VOMS proxy and kerberos token.

@goblirsc goblirsc force-pushed the MG_fixGBLflush_16_0_X branch from ccf2198 to 1be9f9d Compare March 17, 2026 16:37
@cmsbuild
Copy link
Copy Markdown
Contributor

Pull request #50447 was updated.

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Mar 17, 2026

test parameters:

  • workflows = 1000,1001,1001.2,1001.3,1001.4,1002.3,1002.4,1002.5

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Mar 17, 2026

@cmsbuild, please test

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Mar 17, 2026

I confirm that a previously failing setup:

#!/bin/bash -ex

dasgoclient --limit 0 --query 'file dataset=/HLTMonitor/Run2026A-Express-v1/FEVTHLTALL run=401691' | sort -u > input_files.txt
echo '{"401691": [[36, 71]]}' > step1_lumiRanges.txt

cmsDriver.py ReAlCaHLT \
  -s ALCA:TkAlHLTTracks+TkAlHLTTracksZMuMu+PromptCalibProdSiPixelAliHLTHGC \
  --conditions 160X_dataRun3_Express_v1 \
  --scenario pp \
  --data \
  --era Run3_2026 \
  --datatier ALCARECO \
  --eventcontent ALCARECO \
  --process RECO \
  --processName ReAlCa \
  --filein filelist:input_files.txt \
  --lumiToProcess step1_lumiRanges.txt \
  -n 1000 >& ReAlCa.log

cmsDriver.py ALCAHARVDSIPIXELALIHLTHGCOMBINED \
  -s ALCAHARVEST:SiPixelAliHLTHGCombined \
  --conditions 160X_dataRun3_Express_v1 \
  --scenario pp \
  --era Run3_2026 \
  --data \
  -n -1 \
  --filein file:PromptCalibProdSiPixelAliHLTHGC.root \
  --customise Alignment/CommonAlignmentProducer/customizeLSNumberFilterForRelVals.lowerHitsPerStructure >& Harvesting.log

now produces a payload, despite in the log I find:

%MSG-e Alignment:   AlignmentProducerAsAnalyzer:SiPixelAliPedeAlignmentProducerHLTHGCombined@endProcessBlock  PedeSteerer::runPede() 17-Mar-2026 17:38:11 CET post-events
Failed to call Pede
%MSG

@makortel
Copy link
Copy Markdown
Contributor

makortel commented Mar 17, 2026

@cms-sw/core-l2 do you have any hint on #50447 (comment) (see also PR description). It looks like something changed in the way calls are scheduled.

As a first comment, in absence of data dependencies between modules the order of modules for which endLuminosityBlock() / endLuminosityBlockProduce() are called is unspecified. What module (C++ class) uses the MillePedeAlignmentAlgorithm in this case?

@goblirsc
Copy link
Copy Markdown
Contributor Author

I confirm that a previously failing setup:

#!/bin/bash -ex

dasgoclient --limit 0 --query 'file dataset=/HLTMonitor/Run2026A-Express-v1/FEVTHLTALL run=401691' | sort -u > input_files.txt
echo '{"401691": [[36, 71]]}' > step1_lumiRanges.txt

cmsDriver.py ReAlCaHLT \
  -s ALCA:TkAlHLTTracks+TkAlHLTTracksZMuMu+PromptCalibProdSiPixelAliHLTHGC \
  --conditions 160X_dataRun3_Express_v1 \
  --scenario pp \
  --data \
  --era Run3_2026 \
  --datatier ALCARECO \
  --eventcontent ALCARECO \
  --process RECO \
  --processName ReAlCa \
  --filein filelist:input_files.txt \
  --lumiToProcess step1_lumiRanges.txt \
  -n 1000 >& ReAlCa.log

cmsDriver.py ALCAHARVDSIPIXELALIHLTHGCOMBINED \
  -s ALCAHARVEST:SiPixelAliHLTHGCombined \
  --conditions 160X_dataRun3_Express_v1 \
  --scenario pp \
  --era Run3_2026 \
  --data \
  -n -1 \
  --filein file:PromptCalibProdSiPixelAliHLTHGC.root \
  --customise Alignment/CommonAlignmentProducer/customizeLSNumberFilterForRelVals.lowerHitsPerStructure >& Harvesting.log

now produces a payload, despite in the log I find:

%MSG-e Alignment:   AlignmentProducerAsAnalyzer:SiPixelAliPedeAlignmentProducerHLTHGCombined@endProcessBlock  PedeSteerer::runPede() 17-Mar-2026 17:38:11 CET post-events
Failed to call Pede
%MSG

Hi, I just tried to reproduce this, but for me, the pede step succeeds with

%MSG-e AlignmentIORootBase:   AlignmentProducerAsAnalyzer:SiPixelAliPedeAlignmentProducerHLTHGCombined@endProcessBlock  17-Mar-2026 17:46:50 CET post-events
Iteration 0 invalid or already exists for tree AlignablesOrgPos
(...)

and several warnings for inactive alignables, which however makes sense with just 1000 events

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Mar 17, 2026

Hi, I just tried to reproduce this, but for me, the pede step succeeds with

I have run merging locally #50420

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Mar 17, 2026

Hi, I ran runTheMatrix.py -l 1000,1001,1001.2,1001.3,1001.4,1002.3,1002.4,1002.5 -i all --ibeos and this ran into the issue. Is the position of --ibeos important? I did also check for a valid VOMS proxy and kerberos token.

I think that's precisely the problem (having a valid VOMS proxy and try to run the --ibeos).
If I set the env variables by hand:

setenv CMS_PATH "/cvmfs/cms-ib.cern.ch"
setenv SITECONFIG_PATH "/cvmfs/cms-ib.cern.ch/SITECONF/local"

things run correctly for me -- I encourage you to use this in the future.

@goblirsc
Copy link
Copy Markdown
Contributor Author

#50420

Ah, great! I didn't merge that one in my test. Then yes, the message is expected: The exit code from this run is

    1   Ended with warnings (bad measurements)

So it should indeed warn the user about a non-nominal exit code.

For a follow-up (though maybe on master instead of the T0 release), we can consider making the log entry less alarming for warnings that do not have error status.

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Mar 17, 2026

So it should indeed warn the user about a non-nominal exit code.

in my opinion this is very misleading. A fix is in order (to be backported).

@goblirsc
Copy link
Copy Markdown
Contributor Author

@cms-sw/core-l2 do you have any hint on #50447 (comment) (see also PR description). It looks like something changed in the way calls are scheduled.

As a first comment, in absence of data dependencies between modules the order of modules for which endLuminosityBlock() / endLuminosityBlockProduce() are called is unspecified. What module (C++ class) uses the MillePedeAlignmentAlgorithm in this case?

Hi, in this case it seems to be an AlignmentProducerAsAnalyzer via ALCARECOPromptCalibProdSiPixelAli_cff.py.

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Mar 17, 2026

type bug-fix

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Mar 17, 2026

urgent

  • this must go into the next production release

@goblirsc goblirsc marked this pull request as ready for review March 17, 2026 17:59
@cmsbuild
Copy link
Copy Markdown
Contributor

A new Pull Request was created by @goblirsc for CMSSW_16_0_X.

It involves the following packages:

  • Alignment/MillePedeAlignmentAlgorithm (alca)

@Alejandro1400, @JanChyczynski, @arunhep, @atpathak, @perrotta can you please review it and eventually sign? Thanks.
@mmusich, @rsreds, @tlampen, @tocheng, @yuanchao this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@cmsbuild
Copy link
Copy Markdown
Contributor

+1

Size: This PR adds an extra 36KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-6441b1/52038/summary.html
COMMIT: 1be9f9d
CMSSW: CMSSW_16_0_X_2026-03-17-1100/el8_amd64_gcc13
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/50447/52038/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 7 lines to the logs
  • Reco comparison results: 12 differences found in the comparisons
  • DQMHistoTests: Total files compared: 59
  • DQMHistoTests: Total histograms compared: 4296185
  • DQMHistoTests: Total failures: 62
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4296103
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 58 files compared)
  • Checked 256 log files, 203 edm output root files, 59 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Copy Markdown
Contributor

@cms-sw/core-l2 do you have any hint on #50447 (comment) (see also PR description). It looks like something changed in the way calls are scheduled.

As a first comment, in absence of data dependencies between modules the order of modules for which endLuminosityBlock() / endLuminosityBlockProduce() are called is unspecified. What module (C++ class) uses the MillePedeAlignmentAlgorithm in this case?

Hi, in this case it seems to be an AlignmentProducerAsAnalyzer via ALCARECOPromptCalibProdSiPixelAli_cff.py.

Thanks. On a quick look I don't see any possible data dependency relationship between MillePedeFileConverter::endLuminosityBlockProduce() and AlignmentProducerAsAnalyzer::endLuminosityBlock(), so they can get run on either order.

If they need to be run in in a specific order, the module whose endLumi needs to be run first should produce something to the LuminosityBlock in its endLumi, that the module to be run second should then consume in its endLumi.

On another quick look I did not see any framework changes that should alter the order of modules in endLumi transition. We did update TBB, that might have an effect or not.

@perrotta
Copy link
Copy Markdown
Contributor

+alca

@cmsbuild
Copy link
Copy Markdown
Contributor

This pull request is fully signed and it will be integrated in one of the next CMSSW_16_0_X IBs (tests are also fine) and once validation in the development release cycle CMSSW_16_1_X is complete. This pull request will now be reviewed by the release team before it's merged. @mandrenguyen, @ftenchini, @sextonkennedy (and backports should be raised in the release meeting by the corresponding L2)

@mandrenguyen
Copy link
Copy Markdown
Contributor

+1

@cmsbuild cmsbuild merged commit 39d6329 into cms-sw:CMSSW_16_0_X Mar 18, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants