Skip to content

[1/n] [reconfigurator-cli] in cmds-mupdate-update-flow, apply sled agent configs#10408

Merged
sunshowers merged 5 commits intomainfrom
sunshowers/spr/1n-reconfigurator-cli-in-cmds-mupdate-update-flow-apply-sled-agent-configs
May 8, 2026
Merged

[1/n] [reconfigurator-cli] in cmds-mupdate-update-flow, apply sled agent configs#10408
sunshowers merged 5 commits intomainfrom
sunshowers/spr/1n-reconfigurator-cli-in-cmds-mupdate-update-flow-apply-sled-agent-configs

Conversation

@sunshowers
Copy link
Copy Markdown
Contributor

@sunshowers sunshowers commented May 7, 2026

In cmds-mupdate-update-flow.txt, add sled-set omicron-config latest commands before setting the mupdate override field for these sleds. Without these, the simulated sled-agent's reported reconciled generation lags behind the parent blueprint's per-sled generation, which ends up breaking forthcoming work to detect stale inventory collections while initially setting the "will remove mupdate override" field within blueprints.

This test-only PR is separated out into its own commit to try and minimize the diff for the forthcoming functional change.

As a result of this change, within the test, sled 0 now applies a previously-pending zone expungement (internal_dns 427ec88f). As a result, the planner confirms the expungement and adds a replacement zone (cfc72ddd, artifact 2.0.0). But this is actually a bug -- on a sled with a mupdate override, replacement zones should come from the install dataset. See #10420.

sunshowers added 2 commits May 7, 2026 15:09
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Copy link
Copy Markdown
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With apologies - I think these comments all point to possible other problems not related to the generation fix coming in #10409. They can certainly be fixed separately, but wanted to point them out to see if I was misunderstanding them.

803bfb63-c246-41db-b0da-d3b87ddfc63d external_dns install-dataset
ba4994a8-23f9-4b1a-a84f-a08d74591389 crucible_pantry install-dataset
ID KIND IMAGE_SOURCE
0c71b3b2-6ceb-4e8f-b020-b08675e83038 nexus artifact: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is wrong for the purposes of the test, but this combo of Nexus having an artifact image source and other zones being sourced from install-dataset shouldn't be possible in practice, right? Either a mupdate has happened (everything is on install-dataset), or we've recovered (everything is an artifact).

IIUC, the test is wanting to show that any zones that were sourced from an artifact are converted to install-dataset after a mupdate. We only have Nexus here as a shortcut to have a shorter test, right? That's probably worth either a comment noting the weird combo, or we could set all the other zones to artifact too for a more typical-looking inventory?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a bit contrived (especially since this is a manual blueprint edit). I think it would be easiest to just add a comment.




> # Diff the blueprints. This diff should show:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a change in this PR, but this description seems to have atrophied. Differences between the description and the actual diff:

  • sled 1's host phase 2 contents change from artifact to current contents (this is surprising since it has the mupdate override error?)
  • sled 1 has a pending MGS update cleared (also surprising to me, both in that we have 2 pending updates and that we'd clear this, although maybe that's normal if there's a mupdate override error?)
  • sled 0 and 2 both also have their measurements changed from unknown to install dataset (seems fine / expected)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code doing this is

(Err(message), bp_override) => {
// If there's an error obtaining mupdate override information:
//
// * Don't set zone image sources to InstallDataset, because
// Sled Agent *will* run zones out of the install dataset in
// case of an error, but will not run zones with an artifact
// source.
// * Do clear host phase 2 and pending MGS updates to reflect
// the reality that Sled Agent will not run zones in this
// case.
// * Reset measurements to whatever we had from before
let prev_mgs_update = match pending_mgs_update {
Entry::Vacant(_) => None,
Entry::Occupied(entry) => Some(Box::new(entry.remove())),
};
let prev_host_phase_2 = self.set_host_phase_2(
BlueprintHostPhase2DesiredSlots::current_contents(),
);
let prev_measurements =
self.reset_to_parent_blueprint_measurements();
// Also update the cached value inside `noop_sled_info`.
if let NoopConvertSledInfoMut::Ok(mut info) = noop_sled_info {
info.status = NoopConvertSledStatus::Ineligible(
NoopConvertSledIneligibleReason::MupdateOverrideError {
message: message.clone(),
},
);
}
Ok(EnsureMupdateOverrideAction::GetOverrideError {
message: message.clone(),
bp_override,
prev_mgs_update,
prev_host_phase_2,
prev_measurements,
})
. In general, an error in the mupdate override is a pretty exceptional situation and (from my recollection working on this) the bias is towards progressing as little as possible.

internal_dns 427ec88f-f467-42fa-9bbb-66a91a36103c artifact: version 1.0.0 expunged ✓ fd00:1122:3344:2::1
internal_ntp 6444f8a5-6465-4f0b-a549-1993c113569c install dataset in service fd00:1122:3344:101::3
nexus 0c71b3b2-6ceb-4e8f-b020-b08675e83038 install dataset in service fd00:1122:3344:101::4
+ internal_dns cfc72ddd-fd83-4396-b1ad-d53f69182a09 artifact: version 2.0.0 in service fd00:1122:3344:2::1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the image source of this addition correct? This sled has a mupdate override in place, which means we don't know what version it's running; if we're adding a zone, should we be sourcing it from the install dataset?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, yeah -- both

// * If the config has a remove_mupdate_override set, then all zones
// should have their image source set to InstallDataset.
if let Some(mupdate_override_id) = new_config.remove_mupdate_override {
for zone in &new_config.zones {
match zone.image_source {
OmicronZoneImageSource::InstallDataset => {}
OmicronZoneImageSource::Artifact { hash } => {
validation_errors.push(format!(
"remove mupdate override \
set to {mupdate_override_id}, but zone {} \
has image source Artifact with hash {hash}",
zone.id,
));
}
}
}
if new_config.host_phase_2
!= HostPhase2DesiredSlots::current_contents()
{
validation_errors.push(format!(
"remove mupdate override set to {mupdate_override_id}, but \
host phase 2 contents are not set to current: {:#?}",
new_config.host_phase_2,
));
}
}
and blippy will reject this config. I'll file a followup to fix this.

Created using spr 1.3.6-beta.1
> blueprint-blippy latest
blippy report for blueprint 9a9e6c32-5a84-4020-a159-33dceff18d35: 2 notes
FATAL note: sled 2b8f0cb3-0295-4b3c-bc58-4fe88b57112c: Nexus zones 466a9f29-62bf-4e63-924a-b9efdb86afec and 0c71b3b2-6ceb-4e8f-b020-b08675e83038 both have generation 1 but different image sources (Artifact { version: Available { version: ArtifactVersion("2.0.0") }, hash: ArtifactHash("e9b7035f41848a987a798c15ac424cc91dd662b1af0920d58d8aa1ebad7467b6") } vs InstallDataset)
FATAL note: sled 98e6b7c2-2efa-41ca-b20a-0a4d61102fe6: sled has remove_mupdate_override set (c8fba912-63ae-473a-9115-0495d10fb3bc), but zone cfc72ddd-fd83-4396-b1ad-d53f69182a09 image source is set to Artifact (version version 2.0.0, hash de30657a72b066b8ef1f56351a0a5d4d7000da0a62c4be9b2e949a107ca8a389)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm! I think the second note here is #10420; the first note is the thing we were talking about last week, right? Blippy is overly aggressive about Nexus generation mismatches in cases like this where we've mupdated but haven't yet seen all the sleds in inventory?

Copy link
Copy Markdown
Contributor Author

@sunshowers sunshowers May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! Will update the comment to clarify this

sunshowers added 2 commits May 8, 2026 11:25
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
@sunshowers sunshowers enabled auto-merge (squash) May 8, 2026 18:28
@sunshowers sunshowers disabled auto-merge May 8, 2026 18:29
@sunshowers sunshowers enabled auto-merge (squash) May 8, 2026 18:30
@sunshowers sunshowers merged commit fa779b2 into main May 8, 2026
16 checks passed
@sunshowers sunshowers deleted the sunshowers/spr/1n-reconfigurator-cli-in-cmds-mupdate-update-flow-apply-sled-agent-configs branch May 8, 2026 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants