The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Talos v1.6.1
- Kernel Version: 6.1.69
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): 1.29.0 - Talos
- GPU Operator Version: 23.9.1
2. Issue or feature description
The operator tries to pull invalid images as it includes redundant information like the kernel and os?
❯ k describe po nvidia-driver-daemonset-6.1.69-talos-talosv1.6.1-xgcqd
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 56s default-scheduler Successfully assigned nvidia-gpu-operator/nvidia-driver-daemonset-6.1.69-talos-talosv1.6.1-xgcqd to rhode
Normal Pulled 18s kubelet Container image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.5" already present on machine
Normal Created 18s kubelet Created container k8s-driver-manager
Normal Started 18s kubelet Started container k8s-driver-manager
Normal BackOff 15s kubelet Back-off pulling image "nvcr.io/nvidia/driver:535.129.03-6.1.69-talos-talosv1.6.1"
Warning Failed 15s kubelet Error: ImagePullBackOff
Normal Pulling 4s (x2 over 17s) kubelet Pulling image "nvcr.io/nvidia/driver:535.129.03-6.1.69-talos-talosv1.6.1"
Warning Failed 2s (x2 over 16s) kubelet Failed to pull image "nvcr.io/nvidia/driver:535.129.03-6.1.69-talos-talosv1.6.1": rpc error: code = NotFound desc = failed to pull and unpack image "nvcr.io/nvidia/driver:535.129.03-6.1.69-talos-talosv1.6.1": failed to resolve reference "nvcr.io/nvidia/driver:535.129.03-6.1.69-talos-talosv1.6.1": nvcr.io/nvidia/driver:535.129.03-6.1.69-talos-talosv1.6.1: not found
Warning Failed 2s (x2 over 16s) kubelet Error: ErrImagePull
❯ k get po
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-pgc7c 0/1 Init:0/1 0 2m47s
nvidia-container-toolkit-daemonset-lw22k 0/1 Init:0/1 0 2m47s
nvidia-dcgm-exporter-qg6j7 0/1 Init:0/1 0 2m47s
nvidia-device-plugin-daemonset-m8z55 0/1 Init:0/1 0 2m47s
nvidia-driver-daemonset-6.1.69-talos-talosv1.6.1-xgcqd 0/1 ImagePullBackOff 0 3m25s
nvidia-gpu-operator-79c7dc6d5-8dhhx 1/1 Running 7 (13m ago) 2d19h
nvidia-operator-validator-xnbhr 0/1 Init:0/4 0 2m47s
3. Steps to reproduce the issue
Deploy the GPU operator with the default configuration on a Talos Kubernetes cluster.
4. Information to attach (optional if deemed irrelevant)
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: operator_feedback@nvidia.com
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
2. Issue or feature description
The operator tries to pull invalid images as it includes redundant information like the kernel and os?
3. Steps to reproduce the issue
Deploy the GPU operator with the default configuration on a Talos Kubernetes cluster.
4. Information to attach (optional if deemed irrelevant)
kubectl get pods -n OPERATOR_NAMESPACEkubectl get ds -n OPERATOR_NAMESPACEkubectl describe pod -n OPERATOR_NAMESPACE POD_NAMEkubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containersnvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smijournalctl -u containerd > containerd.logCollecting full debug bundle (optional):
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: operator_feedback@nvidia.com