Skip to content

operator: add flag to skip decommission on scale down#61

Open
eric-higgins-ai wants to merge 1 commit into
NVIDIA:mainfrom
eric-higgins-ai:skip-decommission-flag
Open

operator: add flag to skip decommission on scale down#61
eric-higgins-ai wants to merge 1 commit into
NVIDIA:mainfrom
eric-higgins-ai:skip-decommission-flag

Conversation

@eric-higgins-ai

@eric-higgins-ai eric-higgins-ai commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

We run into issues currently in the following case:

  • Nodes A and B are running target pods 0 and 1, respectively
  • Node A has a hardware fault, so we drain and terminate it for repair. This leaves target pod 0 in a pending state since no node can host it.
  • We scale down the AIStore cluster by 1 by setting spec.size in the CRD
  • The operator decommissions target pod 1 and then deletes it
  • Target pod 0 gets scheduled on node B, which was previously hosting target pod 1

This decommissioning causes some issues. If everything works well then it needlessly deletes all the data on node B. Target pod 0 is going to be immediately scheduled on node B and could have continued serving the files cached there. It's also possible for the decommission to fail partway through. This deletes some bucket directories and leaves some on the node. After target pod 0 is scheduled on the node, when trying to fetch a file from one of the deleted buckets, the request will fail with a 500 response code.

To address both cases, this adds a flag to skip decommissioning when scaling down either the proxy or target statefulsets.

Comment thread operator/api/v1beta1/aistore_types.go Outdated
@aaronnw

aaronnw commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

If everything works well then it needlessly deletes all the data on node B

Today we make a call to AIS API before deleting pods as part of target scale-down.
_, err = apiClient.DecommissionNode(&aisapc.ActValRmNode{DaemonID: node.ID(), RmUserData: true})

The decommission call removes the AIS node from the cluster map, which I think we need regardless. But we could likely pipe in this new option to RmUserData.

Being able to schedule target-0 onto a node that hosted target-1 is likely going to cause some problems today (although it would be nice). One issue is that we use the pod name in the cluster map alongside potentially node hostIP or hostname, depending on PubNetDNSMode. That will conflict if you just reschedule the pod. We also use the pod ordinal to bind the PVs and state storage today (though hopefully we can remove that restriction).

IF you have the storage bindings set up to avoid these issues, then ideally if we gate data removal but still decommission, "target-0" will reschedule and re-join the cluster map fresh with the new scheduled host info.

Signed-off-by: eric-higgins-ai <erichiggins@applied.co>
@eric-higgins-ai eric-higgins-ai force-pushed the skip-decommission-flag branch from 5a3b53b to 8f23888 Compare June 24, 2026 18:27
@eric-higgins-ai

Copy link
Copy Markdown
Contributor Author

Thanks for the fast response @aaronnw ! I replaced the flag with a rmUserDataOnScaleDown flag that only exists on the target spec.

If my understanding is correct, the cluster map shouldn't be a problem. In my above case, target pod 0 is removed from the cluster map when its node is terminated (because it fails heartbeats from the primary proxy), and target pod 1 is removed on scale down. Target pod 0 will be re-added when it rejoins the cluster from node B. My understanding is also that the daemon ID is stored in the data volumes, so target pod 0 will inherit target pod 1's daemon ID and the HRW hashing won't change.

@eric-higgins-ai

Copy link
Copy Markdown
Contributor Author

We use Oracle Cloud Block Volumes for state storage, which should generally be able to attach to the new node. There are some issues, which I mentioned in this PR, but I was planning to address those by switching to emptyDir for state storage as mentioned there.

@alex-aizman

Copy link
Copy Markdown
Member

By definition, decommission is explicitly a permanent, destructive operation meant to wipe metadata and user data:

If the goal during a K8s scale-down is to preserve the data on the disk/PV for potential reuse, that's actually the exact use case for maintenance or shutdown mode.

​Instead of introducing a flag to alter decommission behavior, it might be cleaner to have the operator leverage the maintenance workflow for scale-downs. This keeps the core abstractions clean and prevents blurring the lines on what a decommission does.

@aaronnw

aaronnw commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

The reason we'd need to decommission is that if target-0 is rescheduled onto a node with target-1 data, that means intra_data_net and intra_control_net will change for that AIS target daemon entry in the cluster map.

There's no way to update those for an existing AIS cluster map entry, so we need to decommission what was target-1 before we reschedule target-0 onto the old k8s node and let it re-register.

@eric-higgins-ai

Copy link
Copy Markdown
Contributor Author

One other case I forgot about is rebalancing (we have rebalance.enabled=false right now so I didn't think of it right away). A rebalance would be excessive in the case I mentioned above, because the data cached on node B will only be inaccessible for a short time.

I'm thinking we can rename the rmUserDataOnScaleDown flag to something like retainUserDataOnScaleDown. If the flag is true then we set SkipRebalance=true and RmUserData=false when decommissioning. Does that make sense to y'all?

@alex-aizman

Copy link
Copy Markdown
Member

There’s no way to update those

Yes there's a way:

https://github.com/NVIDIA/aistore/blob/main/ais/pclupost.go#L498

This is a very old code. Aaron, take another look.

@eric-higgins-ai

Copy link
Copy Markdown
Contributor Author

Is this true in general?

If the goal during a K8s scale-down is to preserve the data on the disk/PV for potential reuse

I have to imagine there are cases where someone would actually want to decommission a node when scaling down

@alex-aizman

Copy link
Copy Markdown
Member

absolutely!

that's why it is important to keep the terminology straight: decommission is decommission, while maintenance is... maintenance ;)

@eric-higgins-ai

Copy link
Copy Markdown
Contributor Author

Ok so if I understand correctly then the preferred solution would be to have a flag (called like softScaleDown or something) control the scale down behavior. If the flag is true then we call StartMaintenance with SkipRebalance before scaling down, and if the flag is false then we keep the current behavior. The default value will be false.

Is that what you're thinking as well?

@aaronnw

aaronnw commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

If we switch to maintenance mode rather than decommission with optional data removal, that brings some additional complexity in taking the node out of maintenance mode.

From what I'm reading, without a fresh join, target-0 after reschedule would come up with the data on the node that hosted target-1, but would still be in maintenance even after re-registering with new intraCluster and intraData urls.

The operator does not reset this because from a K8s perspective, all we are doing is scaling down the statefulset. Which is partially why "decommission" makes a bit more sense here to me. Because our usual flow is scale down == full decommission and here we are just modifying that to not delete the data on disk first. Very similar to cleanupData option -- and I wonder if we could just re-use that.

@aaronnw

aaronnw commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

The operator also has no idea when doing a scale down if any of the other existing pending target pods will be able to reschedule onto the node we are making available by scaling down.

So without a whole bunch of logic trying to figure that out, in most scenarios we really DO want to do a decommission.

@aaronnw

aaronnw commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Because the operator does not know when scaling down if any other pod will suddenly be able to schedule onto the node, I would say rmUserDataOnScaleDown is ok as written.

The fact that another pod will be able to schedule and assume the prior pod's AIS node identity in the smap is not something we know ahead of time.

@alex-aizman

Copy link
Copy Markdown
Member

I still don't understand how Kubernetes can prescribe us to do destructive decommission when we definitely want something else.

The Operator is in charge. There's the API:

The Operator could put a target in maintenance mode once before scale-down, and later clear maintenance after the pod/daemon comes back and re-registers with updated intra-cluster / intra-data URLs.

There's maybe a razor-thin corner case that'll require work on our side. In particular, AIS node ID versus K8s pod name/ID, etc. But one thing is painfully clear: we should not be scaling-down via decommission.

@aaronnw

aaronnw commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

I still don't understand how Kubernetes can prescribe us to do destructive decommission when we definitely want something else.

It's a two step thing. If Eric sets spec.size in the CR such that we want a smaller statefulset then that's what we'll do, which follows our usual scale down process. The user is directly telling us they want to decommission a node.

In this scenario, the pod that was pending is able to assume the node made available by scale down. AFTER the statefulset is already modified. But in a lot of deployments that's not even possible. Trying to scale down while you have pending pods is already an edge case and one to avoid if possible.

The Operator could put a target in maintenance mode once before scale-down, and later clear maintenance after the pod/daemon comes back and re-registers with updated intra-cluster / intra-data URLs.

I am not sure this is possible for a couple reasons.

  1. We can't tell if the scale-down is real or this weird temporary edge case. If it's real, we need a real decommission. (edit: could be resolved if we expect the user to reset the option after)
  2. The operator reconciliation is stateless across different reconciliation loops. So I'm not sure how we'd be able to both predict that the target will be replaced then also be able to differentiate between a target in maintenance vs one that no longer should be. (edit: potentially non-issue if new pod comes up with clean flags + no maintenance. We don't reset this flag today)

In either scenario I feel like to work with a simple flag, it needs to be a temporary option that should be toggled off in spec once done, or we risk data loss and stale smap entries from not properly removing scaled out nodes.:

  • If we do a maintenance call, we need to return to decommission scale-down after things are untangled.
  • If we do a decommission call but skip rebalance/deletion, we'd need to restore that behavior.

If it's not intended as a manual admin action, we might need something more advanced...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants