ImagePullBackOff caused by redundant information from the operator

_The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense._

_**Important Note:  NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case [here](https://enterprise-support.nvidia.com/s/create-case)**._


### 1. Quick Debug Information
* OS/Version(e.g. RHEL8.6, Ubuntu22.04): Talos v1.6.1
* Kernel Version: 6.1.69
* Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
* K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): 1.29.0 - [Talos](https://www.talos.dev/)
* GPU Operator Version: 23.9.1


### 2. Issue or feature description

The operator tries to pull invalid images as it includes redundant information like the kernel and os?

```
❯ k describe po nvidia-driver-daemonset-6.1.69-talos-talosv1.6.1-xgcqd
Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  56s               default-scheduler  Successfully assigned nvidia-gpu-operator/nvidia-driver-daemonset-6.1.69-talos-talosv1.6.1-xgcqd to rhode
  Normal   Pulled     18s               kubelet            Container image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.5" already present on machine
  Normal   Created    18s               kubelet            Created container k8s-driver-manager
  Normal   Started    18s               kubelet            Started container k8s-driver-manager
  Normal   BackOff    15s               kubelet            Back-off pulling image "nvcr.io/nvidia/driver:535.129.03-6.1.69-talos-talosv1.6.1"
  Warning  Failed     15s               kubelet            Error: ImagePullBackOff
  Normal   Pulling    4s (x2 over 17s)  kubelet            Pulling image "nvcr.io/nvidia/driver:535.129.03-6.1.69-talos-talosv1.6.1"
  Warning  Failed     2s (x2 over 16s)  kubelet            Failed to pull image "nvcr.io/nvidia/driver:535.129.03-6.1.69-talos-talosv1.6.1": rpc error: code = NotFound desc = failed to pull and unpack image "nvcr.io/nvidia/driver:535.129.03-6.1.69-talos-talosv1.6.1": failed to resolve reference "nvcr.io/nvidia/driver:535.129.03-6.1.69-talos-talosv1.6.1": nvcr.io/nvidia/driver:535.129.03-6.1.69-talos-talosv1.6.1: not found
  Warning  Failed     2s (x2 over 16s)  kubelet            Error: ErrImagePull
```

```
❯ k get po
NAME                                                     READY   STATUS             RESTARTS      AGE
gpu-feature-discovery-pgc7c                              0/1     Init:0/1           0             2m47s
nvidia-container-toolkit-daemonset-lw22k                 0/1     Init:0/1           0             2m47s
nvidia-dcgm-exporter-qg6j7                               0/1     Init:0/1           0             2m47s
nvidia-device-plugin-daemonset-m8z55                     0/1     Init:0/1           0             2m47s
nvidia-driver-daemonset-6.1.69-talos-talosv1.6.1-xgcqd   0/1     ImagePullBackOff   0             3m25s
nvidia-gpu-operator-79c7dc6d5-8dhhx                      1/1     Running            7 (13m ago)   2d19h
nvidia-operator-validator-xnbhr                          0/1     Init:0/4           0             2m47s
```

### 3. Steps to reproduce the issue

Deploy the GPU operator with the default configuration on a Talos Kubernetes cluster.

### 4. Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/) (optional if deemed irrelevant)

 - [ ] kubernetes pods status: `kubectl get pods -n OPERATOR_NAMESPACE`
 - [ ] kubernetes daemonset status: `kubectl get ds -n OPERATOR_NAMESPACE`
 - [ ] If a pod/ds is in an error state or pending state `kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME`
 - [ ] If a pod/ds is in an error state or pending state `kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers`
 - [ ] Output from running `nvidia-smi` from the driver container: `kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi`
 - [ ] containerd logs `journalctl -u containerd > containerd.log`


Collecting full debug bundle (optional):

```
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh
```
**NOTE**: please refer to the [must-gather](https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh) script for debug data collected.

This bundle can be submitted to us via email: **operator_feedback@nvidia.com**


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ImagePullBackOff caused by redundant information from the operator #647

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ImagePullBackOff caused by redundant information from the operator #647

Description

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions