[Bug]: Mig-Manager restarts kubelet

# mig-manager hook restarts kubelet during cloud-init on AL2023 EKS — corrupts containerd cgroup-v2 state, node permanently NotReady

## Summary

On Amazon EKS with the AL2023 NVIDIA-accelerated AMI (cgroup v2 unified hierarchy) and Karpenter-provisioned GPU nodes, the GPU Operator's `mig-manager` DaemonSet applies its layout very early in the node's lifecycle — while kubelet is still bootstrapping. The hook configuration in the operator's default `default-mig-parted-config` ConfigMap restarts kubelet pre/post-MIG-apply. On cgroup v2 this corrupts containerd's runtime state: the runtime cannot reattach to the cgroups it owned before the restart, GPU `/dev/nvidia*` device-cgroup permissions are left stale, and the node enters an unrecoverable `NotReady` loop. Only EC2 instance replacement clears the state.

The exact same chart configuration on AL2 (cgroup v1, separate hierarchies per controller) does not exhibit this — containerd recovers cleanly across kubelet restarts because controller state is independent.

We have a stable workaround in production: disable `migManager` and apply the MIG layout via cloud-init `userData` using `nvidia-mig-parted` directly, **before** kubelet ever starts, with empty hooks. This issue is filed to (a) document the AL2023 cgroup-v2 caveat and (b) propose upstream fixes so the operator can handle this case natively.

---

## Environment

| Component | Version / setting |
|---|---|
| Amazon EKS | 1.34 |
| Node AMI | AL2023 NVIDIA-accelerated EKS AMI (`al2023@v20260409`) |
| Cgroup hierarchy | unified cgroup v2 (default for AL2023; AL2 used cgroup v1 and was unaffected) |
| Node provisioner | Karpenter (`amiFamily: AL2023`) |
| Instance type | `g7e.4xlarge` (NVIDIA RTX PRO 6000 Blackwell Server Edition) |
| GPU vBIOS | ≥ 98.02.55.00.00 |
| NVIDIA driver | R580+ (host-installed by AMI; not operator-managed) |
| GPU Operator chart | `gpu-operator` v25.10.0 |
| `migManager.enabled` | `true` |
| `default-mig-parted-config` | shipped default, profile `all-1g.24gb` |
| DRA driver | `nvidia-dra-driver-gpu` v25.8.0 (enabled) |
| MIG label source | applied via Karpenter NodePool template metadata: `nvidia.com/mig.config = all-1g.24gb` |
| Goal | 4 × `1g.24gb` MIG partitions ready for DRA scheduling at node Ready |

---

## Symptoms

1. The node joins the cluster and reports `Ready` briefly (typically <30 s).
2. Within ~30 s of the first Ready, kubelet drops to `NotReady`.
3. containerd loses track of pause containers — its bookkeeping references cgroup paths that no longer exist (or are stale from the pre-restart kubelet).
4. `/dev/nvidia*` device files are in an inconsistent state: some present, some missing, mismatched permissions.
5. kubelet logs show repeated container runtime errors after the mig-manager hook fires.
6. The node never recovers. Only remediation is replacing the EC2 instance entirely (Karpenter delete + new node).

---

## Root cause analysis

1. The `mig-manager` DaemonSet pod is scheduled and starts on the node while kubelet is still in the middle of bootstrap. (cloud-init has finished, the kubelet binary is up enough to register the node, but nodeadm / containerd / kubelet aren't fully steady-state yet.)
2. mig-manager reads `nvidia.com/mig.config = all-1g.24gb` from the node label, sees a layout change is needed, and runs `nvidia-mig-parted apply` with the operator's bundled hooks (`default-mig-parted-config` ConfigMap, `hooks.yaml` section).
3. `hooks.yaml` hardcodes a kubelet restart (and optionally a containerd restart) pre- and post-MIG-apply.
4. On cgroup v2 unified hierarchy, restarting kubelet mid-bootstrap while pause containers and other early-boot containers still hold cgroup state corrupts containerd's runtime tracking. Containerd cannot reattach to the existing cgroups after the restart, but those cgroups still hold device-allocation state from the previous kubelet's perspective.
5. Net result:
   - GPU device-cgroup permissions are stale.
   - containerd believes it owns containers that aren't actually managed anymore.
   - kubelet retries fail in a tight loop because the runtime is in an inconsistent state.
   - The node sits in `NotReady` indefinitely.

### Why this didn't bite on AL2

AL2 uses cgroup v1 with separate hierarchies per controller (cpu, memory, devices, etc.). Stale state in one controller does not cascade to others. containerd recovers cleanly across a kubelet restart because the per-controller boundaries isolate the failure modes.

AL2023 (and any modern distro defaulting to cgroup v2 unified) does not have those isolation boundaries. The single unified hierarchy makes the kubelet-restart-while-bootstrapping pattern unsafe.

---

## Reproduction steps

1. Deploy a Karpenter-managed EKS cluster on EKS 1.34 with `amiFamily: AL2023` (or any AL2023-based NVIDIA AMI).
2. Install GPU Operator v25.10.0 with `migManager.enabled: true` and the default `default-mig-parted-config` ConfigMap (i.e. the shipped `hooks.yaml` with the kubelet restart hook intact).
3. Configure a Karpenter NodePool that:
   - Provisions an instance with MIG-capable GPUs (e.g. `g7e.4xlarge` / RTX PRO 6000 Blackwell, or any A100/H100/H200 SKU).
   - Sets the node label `nvidia.com/mig.config = all-1g.24gb` (or any non-trivial layout that requires a layout change vs. firmware default).
4. Provision a node and observe its lifecycle. Within ~30 s of first `Ready`, the node will drop to `NotReady` and stay there.
5. `kubectl describe node <name>` and the host's `/var/log/cloud-init-output.log` + `journalctl -u kubelet -u containerd` will show the cgroup / runtime corruption symptoms above.

The same configuration on AL2 (cgroup v1) reproduces neither the corruption nor the NotReady transition.

---

## Workaround we deployed

Disable mig-manager and apply the MIG layout in cloud-init `userData`, before kubelet ever starts, using `nvidia-mig-parted` directly — with empty hooks so neither kubelet nor containerd is touched.

**Helm values change:**

```yaml
migManager:
  enabled: false
```

**Karpenter `EC2NodeClass` userData (the script is wrapped as MIME multi-part by Karpenter when `amiFamily: AL2023` is set; the supplied script itself is plain bash):**

```bash
#!/bin/bash
set -euxo pipefail

# Wait for NVIDIA driver module to load (up to 2 min)
for i in $(seq 1 60); do
  if nvidia-smi -L >/dev/null 2>&1; then break; fi
  sleep 2
done

# Install mig-parted RPM directly (same binary mig-manager uses internally)
dnf install -y https://github.com/NVIDIA/mig-parted/releases/download/v0.14.0/nvidia-mig-manager-0.14.0-1.x86_64.rpm

mkdir -p /etc/nvidia-mig-manager
cat > /etc/nvidia-mig-manager/config.yaml <<'YAML'
version: v1
mig-configs:
  all-1g.24gb:
    - devices: all
      mig-enabled: true
      mig-devices:
        "1g.24gb": 4
YAML

cat > /etc/nvidia-mig-manager/hooks.yaml <<'YAML'
version: v1
hooks: {}    # NB: empty hooks — no kubelet restart, no containerd touch
YAML

/usr/bin/nvidia-mig-parted apply \
  -f /etc/nvidia-mig-manager/config.yaml \
  -k /etc/nvidia-mig-manager/hooks.yaml \
  -c all-1g.24gb

nvidia-smi -L
nvidia-smi mig -lgi
```

### Why this works

- MIG partitioning is stored in **GPU firmware** and persists across reboots, so a one-shot apply at first boot is sufficient for the EC2 instance lifetime.
- `hooks: {}` skips the kubelet/containerd restarts entirely. They aren't needed at this point — no GPU client containers exist yet (kubelet hasn't even joined the cluster).
- The DRA driver's `ResourceSlice` publishing happens **after** kubelet is up and stable; it sees the already-applied MIG layout and reports it correctly without needing mig-manager.
- Karpenter node replacement is safe: each new EC2 instance runs cloud-init from scratch and applies the layout at first boot.

### Verification

- 8-pod CUDA workload (PoC) ran successfully across all 4 MIG slices (4 distinct UUIDs; both dedicated and shared modes verified).
- DRA scheduler binds claims to the existing partitions on first try; no node bouncing.
- Stable across multiple Karpenter node replacements over a multi-week test environment.

---

## Suggested upstream fixes

In order of preference:

### 1. Add `migManager.skipKubeletRestartOnInitialApply` chart value

A new boolean values flag that suppresses the kubelet restart hook for the *first* apply on a node (when no GPU client containers exist yet). On layout *changes* against an already-running cluster, the hook still fires as today.

This preserves mig-manager's value for in-place layout changes on running clusters while solving the initial-apply case cleanly. mig-manager could detect "no GPU clients running yet" by checking for any pods with `nvidia.com/gpu` resource requests on the node, or by inspecting whether kubelet has fully reported Ready+steady-state.

### 2. Cloud-init context detection in the hook

If `cloud-init status` (or the equivalent on non-cloud-init systems) reports the node is still in early-boot, skip the kubelet/containerd restart entirely. `nvidia-mig-parted apply` runs but the hooks are no-ops. This is the "smartest" fix because it requires no values flag and self-heals across distros.

### 3. Documentation update

Update the AL2023 / EKS doc page (https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/amazon-eks.html) to call out the cgroup v2 caveat explicitly, and recommend cloud-init-based MIG application for the *initial layout* on AL2023 (and any cgroup-v2 distro). The doc could embed a script very similar to the workaround above as the "recommended initial-layout" pattern, while keeping mig-manager as the supported in-cluster path for layout changes after the node is steady.

Even just (3) would be a meaningful improvement — most operators hitting this today don't realize the AL2 → AL2023 cgroup-version change is the root cause. A doc note would save them the multi-day debug.

---

## Related issues

- [`NVIDIA/gpu-operator#1323`](https://github.com/NVIDIA/gpu-operator/issues/1323) — similar containerd corruption on k0s. Different distro, same class of failure: cgroup v2 + runtime state corruption when mid-bootstrap restarts hit. Strong evidence the kubelet-restart hook is generally unsafe on cgroup v2.
- [`awslabs/amazon-eks-ami#2323`](https://github.com/awslabs/amazon-eks-ami/issues/2323) — AL2023 NVIDIA AMI EGL/driver presence issue. Different problem (in the same operational space) but useful cross-reference for anyone debugging AL2023 + GPU.

---

Thanks for the operator's overall good ergonomics — it's been a pleasure to deploy in every other respect, and the rest of the stack (driver daemonset, device plugin, DCGM exporter, validator) has been rock-solid for us. Happy to test any proposed fix in our test environment (EKS 1.34 + AL2023 + Karpenter + RTX PRO 6000 Blackwell) and report back. Let me know if more diagnostic data (full cloud-init logs, kubelet+containerd journals from a corrupted node, mig-manager pod logs from the moment of the hook fire) would be helpful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Mig-Manager restarts kubelet #2415

mig-manager hook restarts kubelet during cloud-init on AL2023 EKS — corrupts containerd cgroup-v2 state, node permanently NotReady

Summary

Environment

Symptoms

Root cause analysis

Why this didn't bite on AL2

Reproduction steps

Workaround we deployed

Why this works

Verification

Suggested upstream fixes

1. Add `migManager.skipKubeletRestartOnInitialApply` chart value

2. Cloud-init context detection in the hook

3. Documentation update

Related issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Version / setting
Amazon EKS	1.34
Node AMI	AL2023 NVIDIA-accelerated EKS AMI (`al2023@v20260409`)
Cgroup hierarchy	unified cgroup v2 (default for AL2023; AL2 used cgroup v1 and was unaffected)
Node provisioner	Karpenter (`amiFamily: AL2023`)
Instance type	`g7e.4xlarge` (NVIDIA RTX PRO 6000 Blackwell Server Edition)
GPU vBIOS	≥ 98.02.55.00.00
NVIDIA driver	R580+ (host-installed by AMI; not operator-managed)
GPU Operator chart	`gpu-operator` v25.10.0
`migManager.enabled`	`true`
`default-mig-parted-config`	shipped default, profile `all-1g.24gb`
DRA driver	`nvidia-dra-driver-gpu` v25.8.0 (enabled)
MIG label source	applied via Karpenter NodePool template metadata: `nvidia.com/mig.config = all-1g.24gb`
Goal	4 × `1g.24gb` MIG partitions ready for DRA scheduling at node Ready

[Bug]: Mig-Manager restarts kubelet #2415

Description

mig-manager hook restarts kubelet during cloud-init on AL2023 EKS — corrupts containerd cgroup-v2 state, node permanently NotReady

Summary

Environment

Symptoms

Root cause analysis

Why this didn't bite on AL2

Reproduction steps

Workaround we deployed

Why this works

Verification

Suggested upstream fixes

1. Add migManager.skipKubeletRestartOnInitialApply chart value

2. Cloud-init context detection in the hook

3. Documentation update

Related issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Add `migManager.skipKubeletRestartOnInitialApply` chart value