Skip to content

[Bug]: Mig-Manager restarts kubelet #2415

@YulianBortsov

Description

@YulianBortsov

mig-manager hook restarts kubelet during cloud-init on AL2023 EKS — corrupts containerd cgroup-v2 state, node permanently NotReady

Summary

On Amazon EKS with the AL2023 NVIDIA-accelerated AMI (cgroup v2 unified hierarchy) and Karpenter-provisioned GPU nodes, the GPU Operator's mig-manager DaemonSet applies its layout very early in the node's lifecycle — while kubelet is still bootstrapping. The hook configuration in the operator's default default-mig-parted-config ConfigMap restarts kubelet pre/post-MIG-apply. On cgroup v2 this corrupts containerd's runtime state: the runtime cannot reattach to the cgroups it owned before the restart, GPU /dev/nvidia* device-cgroup permissions are left stale, and the node enters an unrecoverable NotReady loop. Only EC2 instance replacement clears the state.

The exact same chart configuration on AL2 (cgroup v1, separate hierarchies per controller) does not exhibit this — containerd recovers cleanly across kubelet restarts because controller state is independent.

We have a stable workaround in production: disable migManager and apply the MIG layout via cloud-init userData using nvidia-mig-parted directly, before kubelet ever starts, with empty hooks. This issue is filed to (a) document the AL2023 cgroup-v2 caveat and (b) propose upstream fixes so the operator can handle this case natively.


Environment

Component Version / setting
Amazon EKS 1.34
Node AMI AL2023 NVIDIA-accelerated EKS AMI (al2023@v20260409)
Cgroup hierarchy unified cgroup v2 (default for AL2023; AL2 used cgroup v1 and was unaffected)
Node provisioner Karpenter (amiFamily: AL2023)
Instance type g7e.4xlarge (NVIDIA RTX PRO 6000 Blackwell Server Edition)
GPU vBIOS ≥ 98.02.55.00.00
NVIDIA driver R580+ (host-installed by AMI; not operator-managed)
GPU Operator chart gpu-operator v25.10.0
migManager.enabled true
default-mig-parted-config shipped default, profile all-1g.24gb
DRA driver nvidia-dra-driver-gpu v25.8.0 (enabled)
MIG label source applied via Karpenter NodePool template metadata: nvidia.com/mig.config = all-1g.24gb
Goal 4 × 1g.24gb MIG partitions ready for DRA scheduling at node Ready

Symptoms

  1. The node joins the cluster and reports Ready briefly (typically <30 s).
  2. Within ~30 s of the first Ready, kubelet drops to NotReady.
  3. containerd loses track of pause containers — its bookkeeping references cgroup paths that no longer exist (or are stale from the pre-restart kubelet).
  4. /dev/nvidia* device files are in an inconsistent state: some present, some missing, mismatched permissions.
  5. kubelet logs show repeated container runtime errors after the mig-manager hook fires.
  6. The node never recovers. Only remediation is replacing the EC2 instance entirely (Karpenter delete + new node).

Root cause analysis

  1. The mig-manager DaemonSet pod is scheduled and starts on the node while kubelet is still in the middle of bootstrap. (cloud-init has finished, the kubelet binary is up enough to register the node, but nodeadm / containerd / kubelet aren't fully steady-state yet.)
  2. mig-manager reads nvidia.com/mig.config = all-1g.24gb from the node label, sees a layout change is needed, and runs nvidia-mig-parted apply with the operator's bundled hooks (default-mig-parted-config ConfigMap, hooks.yaml section).
  3. hooks.yaml hardcodes a kubelet restart (and optionally a containerd restart) pre- and post-MIG-apply.
  4. On cgroup v2 unified hierarchy, restarting kubelet mid-bootstrap while pause containers and other early-boot containers still hold cgroup state corrupts containerd's runtime tracking. Containerd cannot reattach to the existing cgroups after the restart, but those cgroups still hold device-allocation state from the previous kubelet's perspective.
  5. Net result:
    • GPU device-cgroup permissions are stale.
    • containerd believes it owns containers that aren't actually managed anymore.
    • kubelet retries fail in a tight loop because the runtime is in an inconsistent state.
    • The node sits in NotReady indefinitely.

Why this didn't bite on AL2

AL2 uses cgroup v1 with separate hierarchies per controller (cpu, memory, devices, etc.). Stale state in one controller does not cascade to others. containerd recovers cleanly across a kubelet restart because the per-controller boundaries isolate the failure modes.

AL2023 (and any modern distro defaulting to cgroup v2 unified) does not have those isolation boundaries. The single unified hierarchy makes the kubelet-restart-while-bootstrapping pattern unsafe.


Reproduction steps

  1. Deploy a Karpenter-managed EKS cluster on EKS 1.34 with amiFamily: AL2023 (or any AL2023-based NVIDIA AMI).
  2. Install GPU Operator v25.10.0 with migManager.enabled: true and the default default-mig-parted-config ConfigMap (i.e. the shipped hooks.yaml with the kubelet restart hook intact).
  3. Configure a Karpenter NodePool that:
    • Provisions an instance with MIG-capable GPUs (e.g. g7e.4xlarge / RTX PRO 6000 Blackwell, or any A100/H100/H200 SKU).
    • Sets the node label nvidia.com/mig.config = all-1g.24gb (or any non-trivial layout that requires a layout change vs. firmware default).
  4. Provision a node and observe its lifecycle. Within ~30 s of first Ready, the node will drop to NotReady and stay there.
  5. kubectl describe node <name> and the host's /var/log/cloud-init-output.log + journalctl -u kubelet -u containerd will show the cgroup / runtime corruption symptoms above.

The same configuration on AL2 (cgroup v1) reproduces neither the corruption nor the NotReady transition.


Workaround we deployed

Disable mig-manager and apply the MIG layout in cloud-init userData, before kubelet ever starts, using nvidia-mig-parted directly — with empty hooks so neither kubelet nor containerd is touched.

Helm values change:

migManager:
  enabled: false

Karpenter EC2NodeClass userData (the script is wrapped as MIME multi-part by Karpenter when amiFamily: AL2023 is set; the supplied script itself is plain bash):

#!/bin/bash
set -euxo pipefail

# Wait for NVIDIA driver module to load (up to 2 min)
for i in $(seq 1 60); do
  if nvidia-smi -L >/dev/null 2>&1; then break; fi
  sleep 2
done

# Install mig-parted RPM directly (same binary mig-manager uses internally)
dnf install -y https://github.com/NVIDIA/mig-parted/releases/download/v0.14.0/nvidia-mig-manager-0.14.0-1.x86_64.rpm

mkdir -p /etc/nvidia-mig-manager
cat > /etc/nvidia-mig-manager/config.yaml <<'YAML'
version: v1
mig-configs:
  all-1g.24gb:
    - devices: all
      mig-enabled: true
      mig-devices:
        "1g.24gb": 4
YAML

cat > /etc/nvidia-mig-manager/hooks.yaml <<'YAML'
version: v1
hooks: {}    # NB: empty hooks — no kubelet restart, no containerd touch
YAML

/usr/bin/nvidia-mig-parted apply \
  -f /etc/nvidia-mig-manager/config.yaml \
  -k /etc/nvidia-mig-manager/hooks.yaml \
  -c all-1g.24gb

nvidia-smi -L
nvidia-smi mig -lgi

Why this works

  • MIG partitioning is stored in GPU firmware and persists across reboots, so a one-shot apply at first boot is sufficient for the EC2 instance lifetime.
  • hooks: {} skips the kubelet/containerd restarts entirely. They aren't needed at this point — no GPU client containers exist yet (kubelet hasn't even joined the cluster).
  • The DRA driver's ResourceSlice publishing happens after kubelet is up and stable; it sees the already-applied MIG layout and reports it correctly without needing mig-manager.
  • Karpenter node replacement is safe: each new EC2 instance runs cloud-init from scratch and applies the layout at first boot.

Verification

  • 8-pod CUDA workload (PoC) ran successfully across all 4 MIG slices (4 distinct UUIDs; both dedicated and shared modes verified).
  • DRA scheduler binds claims to the existing partitions on first try; no node bouncing.
  • Stable across multiple Karpenter node replacements over a multi-week test environment.

Suggested upstream fixes

In order of preference:

1. Add migManager.skipKubeletRestartOnInitialApply chart value

A new boolean values flag that suppresses the kubelet restart hook for the first apply on a node (when no GPU client containers exist yet). On layout changes against an already-running cluster, the hook still fires as today.

This preserves mig-manager's value for in-place layout changes on running clusters while solving the initial-apply case cleanly. mig-manager could detect "no GPU clients running yet" by checking for any pods with nvidia.com/gpu resource requests on the node, or by inspecting whether kubelet has fully reported Ready+steady-state.

2. Cloud-init context detection in the hook

If cloud-init status (or the equivalent on non-cloud-init systems) reports the node is still in early-boot, skip the kubelet/containerd restart entirely. nvidia-mig-parted apply runs but the hooks are no-ops. This is the "smartest" fix because it requires no values flag and self-heals across distros.

3. Documentation update

Update the AL2023 / EKS doc page (https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/amazon-eks.html) to call out the cgroup v2 caveat explicitly, and recommend cloud-init-based MIG application for the initial layout on AL2023 (and any cgroup-v2 distro). The doc could embed a script very similar to the workaround above as the "recommended initial-layout" pattern, while keeping mig-manager as the supported in-cluster path for layout changes after the node is steady.

Even just (3) would be a meaningful improvement — most operators hitting this today don't realize the AL2 → AL2023 cgroup-version change is the root cause. A doc note would save them the multi-day debug.


Related issues

  • NVIDIA/gpu-operator#1323 — similar containerd corruption on k0s. Different distro, same class of failure: cgroup v2 + runtime state corruption when mid-bootstrap restarts hit. Strong evidence the kubelet-restart hook is generally unsafe on cgroup v2.
  • awslabs/amazon-eks-ami#2323 — AL2023 NVIDIA AMI EGL/driver presence issue. Different problem (in the same operational space) but useful cross-reference for anyone debugging AL2023 + GPU.

Thanks for the operator's overall good ergonomics — it's been a pleasure to deploy in every other respect, and the rest of the stack (driver daemonset, device plugin, DCGM exporter, validator) has been rock-solid for us. Happy to test any proposed fix in our test environment (EKS 1.34 + AL2023 + Karpenter + RTX PRO 6000 Blackwell) and report back. Let me know if more diagnostic data (full cloud-init logs, kubelet+containerd journals from a corrupted node, mig-manager pod logs from the moment of the hook fire) would be helpful.

Metadata

Metadata

Labels

bugIssue/PR to expose/discuss/fix a bugneeds-triageissue or PR has not been assigned a priority-px label

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions