[1/n] [reconfigurator-cli] in cmds-mupdate-update-flow, apply sled agent configs#10408
Conversation
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
jgallagher
left a comment
There was a problem hiding this comment.
With apologies - I think these comments all point to possible other problems not related to the generation fix coming in #10409. They can certainly be fixed separately, but wanted to point them out to see if I was misunderstanding them.
| 803bfb63-c246-41db-b0da-d3b87ddfc63d external_dns install-dataset | ||
| ba4994a8-23f9-4b1a-a84f-a08d74591389 crucible_pantry install-dataset | ||
| ID KIND IMAGE_SOURCE | ||
| 0c71b3b2-6ceb-4e8f-b020-b08675e83038 nexus artifact: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 |
There was a problem hiding this comment.
I'm not sure this is wrong for the purposes of the test, but this combo of Nexus having an artifact image source and other zones being sourced from install-dataset shouldn't be possible in practice, right? Either a mupdate has happened (everything is on install-dataset), or we've recovered (everything is an artifact).
IIUC, the test is wanting to show that any zones that were sourced from an artifact are converted to install-dataset after a mupdate. We only have Nexus here as a shortcut to have a shorter test, right? That's probably worth either a comment noting the weird combo, or we could set all the other zones to artifact too for a more typical-looking inventory?
There was a problem hiding this comment.
Yeah this is a bit contrived (especially since this is a manual blueprint edit). I think it would be easiest to just add a comment.
|
|
||
|
|
||
|
|
||
| > # Diff the blueprints. This diff should show: |
There was a problem hiding this comment.
This isn't a change in this PR, but this description seems to have atrophied. Differences between the description and the actual diff:
- sled 1's host phase 2 contents change from
artifacttocurrent contents(this is surprising since it has the mupdate override error?) - sled 1 has a pending MGS update cleared (also surprising to me, both in that we have 2 pending updates and that we'd clear this, although maybe that's normal if there's a mupdate override error?)
- sled 0 and 2 both also have their measurements changed from
unknowntoinstall dataset(seems fine / expected)
There was a problem hiding this comment.
The code doing this is
omicron/nexus/reconfigurator/planning/src/blueprint_editor/sled_editor.rs
Lines 774 to 813 in 7c6ad41
| internal_dns 427ec88f-f467-42fa-9bbb-66a91a36103c artifact: version 1.0.0 expunged ✓ fd00:1122:3344:2::1 | ||
| internal_ntp 6444f8a5-6465-4f0b-a549-1993c113569c install dataset in service fd00:1122:3344:101::3 | ||
| nexus 0c71b3b2-6ceb-4e8f-b020-b08675e83038 install dataset in service fd00:1122:3344:101::4 | ||
| + internal_dns cfc72ddd-fd83-4396-b1ad-d53f69182a09 artifact: version 2.0.0 in service fd00:1122:3344:2::1 |
There was a problem hiding this comment.
Is the image source of this addition correct? This sled has a mupdate override in place, which means we don't know what version it's running; if we're adding a zone, should we be sourcing it from the install dataset?
There was a problem hiding this comment.
Interesting, yeah -- both
omicron/sled-agent/config-reconciler/src/ledger.rs
Lines 455 to 481 in 7c6ad41
Created using spr 1.3.6-beta.1
| > blueprint-blippy latest | ||
| blippy report for blueprint 9a9e6c32-5a84-4020-a159-33dceff18d35: 2 notes | ||
| FATAL note: sled 2b8f0cb3-0295-4b3c-bc58-4fe88b57112c: Nexus zones 466a9f29-62bf-4e63-924a-b9efdb86afec and 0c71b3b2-6ceb-4e8f-b020-b08675e83038 both have generation 1 but different image sources (Artifact { version: Available { version: ArtifactVersion("2.0.0") }, hash: ArtifactHash("e9b7035f41848a987a798c15ac424cc91dd662b1af0920d58d8aa1ebad7467b6") } vs InstallDataset) | ||
| FATAL note: sled 98e6b7c2-2efa-41ca-b20a-0a4d61102fe6: sled has remove_mupdate_override set (c8fba912-63ae-473a-9115-0495d10fb3bc), but zone cfc72ddd-fd83-4396-b1ad-d53f69182a09 image source is set to Artifact (version version 2.0.0, hash de30657a72b066b8ef1f56351a0a5d4d7000da0a62c4be9b2e949a107ca8a389) |
There was a problem hiding this comment.
Hmm! I think the second note here is #10420; the first note is the thing we were talking about last week, right? Blippy is overly aggressive about Nexus generation mismatches in cases like this where we've mupdated but haven't yet seen all the sleds in inventory?
There was a problem hiding this comment.
Yep! Will update the comment to clarify this
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
In
cmds-mupdate-update-flow.txt, addsled-set omicron-config latestcommands before setting the mupdate override field for these sleds. Without these, the simulated sled-agent's reported reconciled generation lags behind the parent blueprint's per-sled generation, which ends up breaking forthcoming work to detect stale inventory collections while initially setting the "will remove mupdate override" field within blueprints.This test-only PR is separated out into its own commit to try and minimize the diff for the forthcoming functional change.
As a result of this change, within the test, sled 0 now applies a previously-pending zone expungement (
internal_dns 427ec88f). As a result, the planner confirms the expungement and adds a replacement zone (cfc72ddd, artifact 2.0.0). But this is actually a bug -- on a sled with a mupdate override, replacement zones should come from the install dataset. See #10420.