fix: add talos support#695
Conversation
|
I can independently confirm that this works on my Talos cluster. After installing with For repeatability, you will need a container image to be built. I have pushed one to asymingt/k8s-dra-driver-gpu. You will need to modify this line to To optionally rebuild the container image, install docker + qemu-binfmt + buildx, checkout this code and run: |
|
I believe this is set to be fixed on the Talos side, by them installing the nvidia stuff where this expects it to go, not the other way around |
|
While we wait for Talos to update its driver install location, I've been trying to get MPS working on Talos using this PR branch and the following helm values. gpuResourcesEnabledOverride: true
resources:
gpus:
enabled: true
computeDomains:
enabled: false
featureGates:
MPSSupport: trueLooks like the It's probably related to this issue: #469 I've opened a PR to fix it on your branch: hydazz#1 |
|
@hydazz given your comment about Talos adjusting themselves to accommodate the existing search paths, how would you propose moving forward with this PR? |
@klueska I don't have definitive knowledge, I just inferred that conclusion based on:
(I could not find such referenced discussion) siderolabs/extensions#836 I don't know if there is talks between nvidia/talos, or whats outside of linked above, but it could easily be fixed here, just with something better than Perhaps @frezbo would have more insight? |
| func getTalosLibrarySearchPaths() []string { | ||
| return []string{ | ||
| "/driver-root/usr/local/glibc/usr/lib", | ||
| "/driver-root/usr/local/glibc/lib", | ||
| "/driver-root/usr/local/glibc/lib64", | ||
| } | ||
| } |
There was a problem hiding this comment.
@elezar is this something we would want to add directly to nvcdi in the nvidia-container-toolkit as a standard search path?
There was a problem hiding this comment.
I don't think there's a problem in adding this to the toolkit. At the moment the defaults are defined at quite a low level (which is where @hydazz has added them in NVIDIA/nvidia-container-toolkit#1621) and we may want to consider making these easier to specify at a higher level.
It would be nice if these paths are supported on the nvidia side, we're (SideroLabs) is open to using better paths, but we have a constraint that it cannot be standard |
|
@klueska what do you think: can we still do something here for the next release? I feel like we should. But now it's rather tight again. |
Signed-off-by: hydazz <alexanderhyde@icloud.com>
Signed-off-by: hydazz <alexanderhyde@icloud.com>
|
opened PR in the toolkit to remove the overwrite here please review and lmk if any changes are needed, keen to jump onto the DRA train 🙂 |
|
@hydazz @jgehrcke if we're expecting the toolkit to be updated to include this change, I don't think that's something that can be done for the upcoming release. Would a middle ground be adding a configurable option (envvar / config file option) that allows a user to specify the paths in the container to be searched for libraries explicilty? This can be passed to the CDI library on construction and used in the prestart scripts. Note that although NVIDIA/nvidia-container-toolkit#1621 gets us some of the way there, we may need to also update the detection logic to also be able to locate |
|
Thank you, @asymingt, for sharing enough to get started with a PoC! |
|
Talos 1.13 now ships |
Builds a patched version of nvcr.io/nvidia/k8s-dra-driver-gpu that adds /usr/local/glibc/usr/lib and /usr/local/bin to the library/binary search paths, matching upstream kubernetes-sigs/dra-driver-nvidia-gpu#695. Renovate will track NVIDIA/k8s-dra-driver-gpu releases to keep the VERSION in sync. Remove this app once PR #695 is released upstream. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Talos 1.13.0 observations with raw nvidia-dra/defaults: Some path overwrites still seem to be needed? either in the scripts here, nvcdi overwrites here, or nvcdi overwrites upstream? Config diff: |
|
Unknown CLA label state. Rechecking for CLA labels. Send feedback to sig-contributor-experience at kubernetes/community. /check-cla |
|
I've a custom installer with nvidia extensions, could you test it, I got dra working helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm upgrade --wait --install -n gpu-operator gpu-operator nvidia/gpu-operator \
--set driver.enabled=false \
--set toolkit.enabled=false \
--set hostPaths.driverInstallDir=/usr/local/lib \
--set cdi.nriPluginEnabled=true \
--version v25.10.1then disable the deviceplugin kubectl patch clusterpolicy cluster-policy --type=merge \
-p '{"spec":{"devicePlugin":{"enabled":false}}}' |
Use this to validate: ---
apiVersion: v1
kind: Namespace
metadata:
name: dra-gpu-share-test
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
namespace: dra-gpu-share-test
name: single-gpu
spec:
spec:
devices:
requests:
- name: gpu
exactly:
deviceClassName: gpu.nvidia.com
---
apiVersion: v1
kind: Pod
metadata:
namespace: dra-gpu-share-test
name: pod
labels:
app: pod
spec:
containers:
- name: ctr0
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: shared-gpu
- name: ctr1
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: shared-gpu
resourceClaims:
- name: shared-gpu
resourceClaimTemplateName: single-gpu
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
|
|
@hydazz can you please sign the CLA? or else we are not able to accept the changes. |
|
Would they still be useful for Talos 1.11 and 1.12? |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
it just makes things complex, we were just using non-standard paths, but now we contained most and not have conflicting stuff |
|
@hydazz needs CLA and rebase! |
|
Talos 1.13 now support DRA as per official docs and this PR will break stuff on Talos, this maybe closed now |
|
Can confirm the existing DRA driver runs with talos v1.13.0-rc.0 - the kubelet plugin config pod running |
This is a starter PR to add support for Talos OS's different nvidia paths.
Tested the gpu component with the changes here in my environment and it works.
Feedback is needed, as i'm unsure how to add the
usr/local/glibcpath to CDI nicely, I don't believegetTalosLibrarySearchPathswill cut it globally...