operator: add flag to skip decommission on scale down#61
operator: add flag to skip decommission on scale down#61eric-higgins-ai wants to merge 1 commit into
Conversation
Today we make a call to AIS API before deleting pods as part of target scale-down. The decommission call removes the AIS node from the cluster map, which I think we need regardless. But we could likely pipe in this new option to Being able to schedule target-0 onto a node that hosted target-1 is likely going to cause some problems today (although it would be nice). One issue is that we use the pod name in the cluster map alongside potentially node hostIP or hostname, depending on IF you have the storage bindings set up to avoid these issues, then ideally if we gate data removal but still decommission, "target-0" will reschedule and re-join the cluster map fresh with the new scheduled host info. |
Signed-off-by: eric-higgins-ai <erichiggins@applied.co>
5a3b53b to
8f23888
Compare
|
Thanks for the fast response @aaronnw ! I replaced the flag with a If my understanding is correct, the cluster map shouldn't be a problem. In my above case, target pod 0 is removed from the cluster map when its node is terminated (because it fails heartbeats from the primary proxy), and target pod 1 is removed on scale down. Target pod 0 will be re-added when it rejoins the cluster from node B. My understanding is also that the daemon ID is stored in the data volumes, so target pod 0 will inherit target pod 1's daemon ID and the HRW hashing won't change. |
|
We use Oracle Cloud Block Volumes for state storage, which should generally be able to attach to the new node. There are some issues, which I mentioned in this PR, but I was planning to address those by switching to emptyDir for state storage as mentioned there. |
|
By definition, decommission is explicitly a permanent, destructive operation meant to wipe metadata and user data: If the goal during a K8s scale-down is to preserve the data on the disk/PV for potential reuse, that's actually the exact use case for maintenance or shutdown mode. Instead of introducing a flag to alter decommission behavior, it might be cleaner to have the operator leverage the maintenance workflow for scale-downs. This keeps the core abstractions clean and prevents blurring the lines on what a decommission does. |
|
The reason we'd need to decommission is that if There's no way to update those for an existing AIS cluster map entry, so we need to decommission what was target-1 before we reschedule target-0 onto the old k8s node and let it re-register. |
|
One other case I forgot about is rebalancing (we have I'm thinking we can rename the |
Yes there's a way: https://github.com/NVIDIA/aistore/blob/main/ais/pclupost.go#L498 This is a very old code. Aaron, take another look. |
|
Is this true in general?
I have to imagine there are cases where someone would actually want to decommission a node when scaling down |
|
absolutely! that's why it is important to keep the terminology straight: decommission is decommission, while maintenance is... maintenance ;) |
|
Ok so if I understand correctly then the preferred solution would be to have a flag (called like Is that what you're thinking as well? |
|
If we switch to maintenance mode rather than decommission with optional data removal, that brings some additional complexity in taking the node out of maintenance mode. From what I'm reading, without a fresh join, target-0 after reschedule would come up with the data on the node that hosted target-1, but would still be in maintenance even after re-registering with new intraCluster and intraData urls. The operator does not reset this because from a K8s perspective, all we are doing is scaling down the statefulset. Which is partially why "decommission" makes a bit more sense here to me. Because our usual flow is scale down == full decommission and here we are just modifying that to not delete the data on disk first. Very similar to cleanupData option -- and I wonder if we could just re-use that. |
|
The operator also has no idea when doing a scale down if any of the other existing pending target pods will be able to reschedule onto the node we are making available by scaling down. So without a whole bunch of logic trying to figure that out, in most scenarios we really DO want to do a decommission. |
|
Because the operator does not know when scaling down if any other pod will suddenly be able to schedule onto the node, I would say The fact that another pod will be able to schedule and assume the prior pod's AIS node identity in the smap is not something we know ahead of time. |
|
I still don't understand how Kubernetes can prescribe us to do destructive decommission when we definitely want something else. The Operator is in charge. There's the API: The Operator could put a target in maintenance mode once before scale-down, and later clear maintenance after the pod/daemon comes back and re-registers with updated intra-cluster / intra-data URLs. There's maybe a razor-thin corner case that'll require work on our side. In particular, AIS node ID versus K8s pod name/ID, etc. But one thing is painfully clear: we should not be scaling-down via decommission. |
It's a two step thing. If Eric sets spec.size in the CR such that we want a smaller statefulset then that's what we'll do, which follows our usual scale down process. The user is directly telling us they want to decommission a node. In this scenario, the pod that was pending is able to assume the node made available by scale down. AFTER the statefulset is already modified. But in a lot of deployments that's not even possible. Trying to scale down while you have pending pods is already an edge case and one to avoid if possible.
I am not sure this is possible for a couple reasons.
In either scenario I feel like to work with a simple flag, it needs to be a temporary option that should be toggled off in spec once done, or we risk data loss and stale smap entries from not properly removing scaled out nodes.:
If it's not intended as a manual admin action, we might need something more advanced... |
We run into issues currently in the following case:
spec.sizein the CRDThis decommissioning causes some issues. If everything works well then it needlessly deletes all the data on node B. Target pod 0 is going to be immediately scheduled on node B and could have continued serving the files cached there. It's also possible for the decommission to fail partway through. This deletes some bucket directories and leaves some on the node. After target pod 0 is scheduled on the node, when trying to fetch a file from one of the deleted buckets, the request will fail with a 500 response code.
To address both cases, this adds a flag to skip decommissioning when scaling down either the proxy or target statefulsets.