Fix: Add missing buffer flush for PCL to MillePedeAlignmentAlgorithm by goblirsc · Pull Request #50447 · cms-sw/cmssw

goblirsc · 2026-03-17T15:39:08Z

PR description:

This PR adds a missing buffer flush call to MillePedeAlignmentAlgorithm::endLuminosityBlock.
The missing call was resulting in the final dimuon track of a job to be truncated at the time the alignment FileBlob was written into the ALCARECO output, resulting in corrupted alignment data and failing high-granularity PCL alignment jobs.
Longer description below.

PR validation:

[ ✔️ ] Confirm that local re-running of PCL job with the change results in a succesful alignment
[ ✔️ ] Unit tests pass
[ ✔️ ] runTheMatrix WF 1000,1001,1001.2,1001.3,1001.4,1002.3,1002.4,1002 (modulo 5 x "input file not found" - known and unrelated to the PR)

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

This PR is only intended as a bugfix for 16_0_X to fix the ongoing PCL high-granularity alignment involving dimuons.
In master (already merged in #49963), a new writing pattern for Mille was adopted that also fixes this issue.

CC @lpnair

Detailed description of bug:

The MillePedeAlignmentAlgorithm has two separate Mille write buffers writing into the same file:
- theMille: Used when a reference trajectory exists (MinBias)
- theBinary: Used to talk to the GBL interface inside the algorithm, used for dimuon pairs
- both write into the same file, but need separate calls to flush their buffers to disk.
the sequence of calls at the end of a PCL job is :
1. MillePedeAlignmentAlgorithm::endLuminosityBlock
  - this flushes only theMille, but not theBinary
2. MillePedeFileConverter::endLuminosityBlockProduce
  - this reads the Mille binary file from disk and writes a 1:1 carbon copy as a FileBlob into the output ALCARECO file.
3. MillePedeAlignmentAlgorithm::terminate
  - this resets and closes both theMille and theBinary - leading to any remaining buffers being flushed to disk.
the bug: We are missing a final flush of the buffer from theBinary before producing the FileBlob
- in our test example, the file at the time of file blob extraction is 46kB, but another 5kB (half of the final record) are written to the file after this call (in terminate) - and thus missed in the FileBlob.

Consequence:

The final record of the dimuon stream is truncated.
The corresponding binary is corrupt
- from Mille's point of view - it is a valid file, but no complete Mille record can be decoded from it.

Proposed Fix in this PR: Make sure to flush the theBinary buffer to disk in MillePedeAlignmentAlgorithm::endLuminosityBlock with a minimal change.

cmsbuild · 2026-03-17T15:39:40Z

Pull request #50447 was updated.

cmsbuild · 2026-03-17T15:39:41Z

cms-bot internal usage

cmsbuild · 2026-03-17T15:48:46Z

Pull request #50447 was updated.

cmsbuild · 2026-03-17T15:53:43Z

Pull request #50447 was updated.

goblirsc · 2026-03-17T16:28:43Z

Not directly relevant for the review of this PR, but in case any experts are reading this: Can you think of a reason why in 15_1_0, this issue did not result in truncated Mille binary files, in spite of (at first glance) identical logic and call ordering?

Are there any known changes between the two release families that could explain why this has become an issue now? @lpnair confirmed that even "old" workflows that used to run in 15_1_0 now produce corrupted files in 16_0_X, so this really seems to be release-specific and not tied to the particular 2026 inputs.

mmusich · 2026-03-17T16:31:25Z

@cms-sw/core-l2 do you have any hint on #50447 (comment) (see also PR description). It looks like something changed in the way calls are scheduled.

mmusich · 2026-03-17T16:31:36Z

@goblirsc please squash the commits to one.

mmusich · 2026-03-17T16:32:50Z

@goblirsc

WF 1000,1001,1001.2,1001.3,1001.4,1002.3,1002.4,1002 (modulo 5 x "input file not found" - known and unrelated to the PR)

as reported elsewhere, if you run with the --ibeos option, the input file not found should be solved.

goblirsc · 2026-03-17T16:35:51Z

@goblirsc

WF 1000,1001,1001.2,1001.3,1001.4,1002.3,1002.4,1002 (modulo 5 x "input file not found" - known and unrelated to the PR)

as reported elsewhere, if you run with the --ibeos option, the input file not found should be solved.

Hi, I ran runTheMatrix.py -l 1000,1001,1001.2,1001.3,1001.4,1002.3,1002.4,1002.5 -i all --ibeos and this ran into the issue. Is the position of --ibeos important? I did also check for a valid VOMS proxy and kerberos token.

cmsbuild · 2026-03-17T16:38:02Z

Pull request #50447 was updated.

mmusich · 2026-03-17T16:38:18Z

test parameters:

workflows = 1000,1001,1001.2,1001.3,1001.4,1002.3,1002.4,1002.5

mmusich · 2026-03-17T16:38:28Z

@cmsbuild, please test

mmusich · 2026-03-17T16:39:43Z

I confirm that a previously failing setup:

#!/bin/bash -ex

dasgoclient --limit 0 --query 'file dataset=/HLTMonitor/Run2026A-Express-v1/FEVTHLTALL run=401691' | sort -u > input_files.txt
echo '{"401691": [[36, 71]]}' > step1_lumiRanges.txt

cmsDriver.py ReAlCaHLT \
  -s ALCA:TkAlHLTTracks+TkAlHLTTracksZMuMu+PromptCalibProdSiPixelAliHLTHGC \
  --conditions 160X_dataRun3_Express_v1 \
  --scenario pp \
  --data \
  --era Run3_2026 \
  --datatier ALCARECO \
  --eventcontent ALCARECO \
  --process RECO \
  --processName ReAlCa \
  --filein filelist:input_files.txt \
  --lumiToProcess step1_lumiRanges.txt \
  -n 1000 >& ReAlCa.log

cmsDriver.py ALCAHARVDSIPIXELALIHLTHGCOMBINED \
  -s ALCAHARVEST:SiPixelAliHLTHGCombined \
  --conditions 160X_dataRun3_Express_v1 \
  --scenario pp \
  --era Run3_2026 \
  --data \
  -n -1 \
  --filein file:PromptCalibProdSiPixelAliHLTHGC.root \
  --customise Alignment/CommonAlignmentProducer/customizeLSNumberFilterForRelVals.lowerHitsPerStructure >& Harvesting.log

now produces a payload, despite in the log I find:

%MSG-e Alignment:   AlignmentProducerAsAnalyzer:SiPixelAliPedeAlignmentProducerHLTHGCombined@endProcessBlock  PedeSteerer::runPede() 17-Mar-2026 17:38:11 CET post-events
Failed to call Pede
%MSG

makortel · 2026-03-17T16:49:57Z

@cms-sw/core-l2 do you have any hint on #50447 (comment) (see also PR description). It looks like something changed in the way calls are scheduled.

As a first comment, in absence of data dependencies between modules the order of modules for which endLuminosityBlock() / endLuminosityBlockProduce() are called is unspecified. What module (C++ class) uses the MillePedeAlignmentAlgorithm in this case?

goblirsc · 2026-03-17T16:50:37Z

I confirm that a previously failing setup:

#!/bin/bash -ex

dasgoclient --limit 0 --query 'file dataset=/HLTMonitor/Run2026A-Express-v1/FEVTHLTALL run=401691' | sort -u > input_files.txt
echo '{"401691": [[36, 71]]}' > step1_lumiRanges.txt

cmsDriver.py ReAlCaHLT \
  -s ALCA:TkAlHLTTracks+TkAlHLTTracksZMuMu+PromptCalibProdSiPixelAliHLTHGC \
  --conditions 160X_dataRun3_Express_v1 \
  --scenario pp \
  --data \
  --era Run3_2026 \
  --datatier ALCARECO \
  --eventcontent ALCARECO \
  --process RECO \
  --processName ReAlCa \
  --filein filelist:input_files.txt \
  --lumiToProcess step1_lumiRanges.txt \
  -n 1000 >& ReAlCa.log

cmsDriver.py ALCAHARVDSIPIXELALIHLTHGCOMBINED \
  -s ALCAHARVEST:SiPixelAliHLTHGCombined \
  --conditions 160X_dataRun3_Express_v1 \
  --scenario pp \
  --era Run3_2026 \
  --data \
  -n -1 \
  --filein file:PromptCalibProdSiPixelAliHLTHGC.root \
  --customise Alignment/CommonAlignmentProducer/customizeLSNumberFilterForRelVals.lowerHitsPerStructure >& Harvesting.log

now produces a payload, despite in the log I find:

%MSG-e Alignment:   AlignmentProducerAsAnalyzer:SiPixelAliPedeAlignmentProducerHLTHGCombined@endProcessBlock  PedeSteerer::runPede() 17-Mar-2026 17:38:11 CET post-events
Failed to call Pede
%MSG

Hi, I just tried to reproduce this, but for me, the pede step succeeds with

%MSG-e AlignmentIORootBase:   AlignmentProducerAsAnalyzer:SiPixelAliPedeAlignmentProducerHLTHGCombined@endProcessBlock  17-Mar-2026 17:46:50 CET post-events
Iteration 0 invalid or already exists for tree AlignablesOrgPos
(...)

and several warnings for inactive alignables, which however makes sense with just 1000 events

mmusich · 2026-03-17T16:51:49Z

Hi, I just tried to reproduce this, but for me, the pede step succeeds with

I have run merging locally #50420

mmusich · 2026-03-17T16:53:54Z

Hi, I ran runTheMatrix.py -l 1000,1001,1001.2,1001.3,1001.4,1002.3,1002.4,1002.5 -i all --ibeos and this ran into the issue. Is the position of --ibeos important? I did also check for a valid VOMS proxy and kerberos token.

I think that's precisely the problem (having a valid VOMS proxy and try to run the --ibeos).
If I set the env variables by hand:

setenv CMS_PATH "/cvmfs/cms-ib.cern.ch"
setenv SITECONFIG_PATH "/cvmfs/cms-ib.cern.ch/SITECONF/local"

things run correctly for me -- I encourage you to use this in the future.

goblirsc · 2026-03-17T16:56:28Z

#50420

Ah, great! I didn't merge that one in my test. Then yes, the message is expected: The exit code from this run is

    1   Ended with warnings (bad measurements)

So it should indeed warn the user about a non-nominal exit code.

For a follow-up (though maybe on master instead of the T0 release), we can consider making the log entry less alarming for warnings that do not have error status.

mmusich · 2026-03-17T16:57:40Z

So it should indeed warn the user about a non-nominal exit code.

in my opinion this is very misleading. A fix is in order (to be backported).

goblirsc · 2026-03-17T17:19:37Z

@cms-sw/core-l2 do you have any hint on #50447 (comment) (see also PR description). It looks like something changed in the way calls are scheduled.

As a first comment, in absence of data dependencies between modules the order of modules for which endLuminosityBlock() / endLuminosityBlockProduce() are called is unspecified. What module (C++ class) uses the MillePedeAlignmentAlgorithm in this case?

Hi, in this case it seems to be an AlignmentProducerAsAnalyzer via ALCARECOPromptCalibProdSiPixelAli_cff.py.

mmusich · 2026-03-17T17:48:31Z

type bug-fix

mmusich · 2026-03-17T17:48:46Z

urgent

this must go into the next production release

cmsbuild · 2026-03-17T18:15:34Z

A new Pull Request was created by @goblirsc for CMSSW_16_0_X.

It involves the following packages:

Alignment/MillePedeAlignmentAlgorithm (alca)

@Alejandro1400, @JanChyczynski, @arunhep, @atpathak, @perrotta can you please review it and eventually sign? Thanks.
@mmusich, @rsreds, @tlampen, @tocheng, @yuanchao this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

cmsbuild · 2026-03-17T18:44:50Z

+1

Size: This PR adds an extra 36KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-6441b1/52038/summary.html
COMMIT: 1be9f9d
CMSSW: CMSSW_16_0_X_2026-03-17-1100/el8_amd64_gcc13
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/50447/52038/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

You potentially added 7 lines to the logs
Reco comparison results: 12 differences found in the comparisons
DQMHistoTests: Total files compared: 59
DQMHistoTests: Total histograms compared: 4296185
DQMHistoTests: Total failures: 62
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 4296103
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 58 files compared)
Checked 256 log files, 203 edm output root files, 59 DQM output files
TriggerResults: no differences found

makortel · 2026-03-18T08:20:54Z

@cms-sw/core-l2 do you have any hint on #50447 (comment) (see also PR description). It looks like something changed in the way calls are scheduled.

As a first comment, in absence of data dependencies between modules the order of modules for which endLuminosityBlock() / endLuminosityBlockProduce() are called is unspecified. What module (C++ class) uses the MillePedeAlignmentAlgorithm in this case?

Hi, in this case it seems to be an AlignmentProducerAsAnalyzer via ALCARECOPromptCalibProdSiPixelAli_cff.py.

Thanks. On a quick look I don't see any possible data dependency relationship between MillePedeFileConverter::endLuminosityBlockProduce() and AlignmentProducerAsAnalyzer::endLuminosityBlock(), so they can get run on either order.

If they need to be run in in a specific order, the module whose endLumi needs to be run first should produce something to the LuminosityBlock in its endLumi, that the module to be run second should then consume in its endLumi.

On another quick look I did not see any framework changes that should alter the order of modules in endLumi transition. We did update TBB, that might have an effect or not.

perrotta · 2026-03-18T13:11:53Z

+alca

cmsbuild · 2026-03-18T13:12:17Z

This pull request is fully signed and it will be integrated in one of the next CMSSW_16_0_X IBs (tests are also fine) and once validation in the development release cycle CMSSW_16_1_X is complete. This pull request will now be reviewed by the release team before it's merged. @mandrenguyen, @ftenchini, @sextonkennedy (and backports should be raised in the release meeting by the corresponding L2)

mandrenguyen · 2026-03-18T15:06:20Z

+1

cmsbuild added this to the CMSSW_16_0_X milestone Mar 17, 2026

cmsbuild added alca-pending pending-signatures tests-pending orp-pending trk labels Mar 17, 2026

fix missing GBL flush in MillePedeAlignmentAlgorithm::endLuminosityBlock

1be9f9d

goblirsc force-pushed the MG_fixGBLflush_16_0_X branch from ccf2198 to 1be9f9d Compare March 17, 2026 16:37

cmsbuild added tests-started and removed tests-pending labels Mar 17, 2026

cmsbuild added bug-fix urgent labels Mar 17, 2026

goblirsc marked this pull request as ready for review March 17, 2026 17:59

cmsbuild added tests-approved and removed tests-started labels Mar 17, 2026

cmsbuild added alca-approved fully-signed and removed alca-pending pending-signatures labels Mar 18, 2026

cmsbuild added orp-approved and removed orp-pending labels Mar 18, 2026

cmsbuild merged commit 39d6329 into cms-sw:CMSSW_16_0_X Mar 18, 2026
9 checks passed

Conversation

goblirsc commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR description:

PR validation:

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Detailed description of bug:

Uh oh!

cmsbuild commented Mar 17, 2026

Uh oh!

cmsbuild commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmsbuild commented Mar 17, 2026

Uh oh!

cmsbuild commented Mar 17, 2026

Uh oh!

goblirsc commented Mar 17, 2026

Uh oh!

mmusich commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mmusich commented Mar 17, 2026

Uh oh!

mmusich commented Mar 17, 2026

Uh oh!

goblirsc commented Mar 17, 2026

Uh oh!

cmsbuild commented Mar 17, 2026

Uh oh!

mmusich commented Mar 17, 2026

Uh oh!

mmusich commented Mar 17, 2026

Uh oh!

mmusich commented Mar 17, 2026

Uh oh!

makortel commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

goblirsc commented Mar 17, 2026

Uh oh!

mmusich commented Mar 17, 2026

Uh oh!

mmusich commented Mar 17, 2026

Uh oh!

goblirsc commented Mar 17, 2026

Uh oh!

mmusich commented Mar 17, 2026

Uh oh!

goblirsc commented Mar 17, 2026

Uh oh!

mmusich commented Mar 17, 2026

Uh oh!

mmusich commented Mar 17, 2026

Uh oh!

cmsbuild commented Mar 17, 2026

Uh oh!

cmsbuild commented Mar 17, 2026

Comparison Summary

Uh oh!

makortel commented Mar 18, 2026

Uh oh!

perrotta commented Mar 18, 2026

Uh oh!

cmsbuild commented Mar 18, 2026

Uh oh!

mandrenguyen commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

goblirsc commented Mar 17, 2026 •

edited

Loading

cmsbuild commented Mar 17, 2026 •

edited

Loading

mmusich commented Mar 17, 2026 •

edited

Loading

makortel commented Mar 17, 2026 •

edited

Loading