[Staging] Add SLES support for AMD gpu-operator#371
Open
Priyankasaggu11929 wants to merge 109 commits intoROCm:stagingfrom
Open
[Staging] Add SLES support for AMD gpu-operator#371Priyankasaggu11929 wants to merge 109 commits intoROCm:stagingfrom
Priyankasaggu11929 wants to merge 109 commits intoROCm:stagingfrom
Conversation
1 task
a22e2e4 to
1138d8e
Compare
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
* add suspend and resume functionality for remediation workflows * minor updates to docs * minor refactoring to avoid duplicate k8s get calls * add default configmap * fix helm chart issues * address code review comments * move remediation configs and scripts into separate files * add jq package to utils_container
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
Co-authored-by: Yuva Shankar <11082310+yuva29@users.noreply.github.com>
… dashboard Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
…fter partitioning Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
…071) Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
(cherry picked from commit 42a7329)
* make max parallel workflows configurable for auto remediation * add zero value in default CR * address review comments (cherry picked from commit f023a5c)
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
(cherry picked from commit 3f1a1ee2ea08f7675a6aba6cd60ed2f06ca7bdc6)
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
(cherry picked from commit c8409b8)
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
* remediation e2e tests for suspend and resume actions * add e2e test for recoverypolicy cr * use init container image from dev.env (cherry picked from commit 3e0f7aa)
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
(cherry picked from commit 4c737d7)
* GPUOP-525 update auto node remediation documentation * address review comments (cherry picked from commit 8e3f3e0)
* customize auto node remediation options * address review comments * commit generated files * support custom labels and taints in workflow * handle custom drain policy * update documentation * fix e2e test (cherry picked from commit 8dd5196)
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
* upgrade Argo workflow CRDs and controller to v4.0.3 (#1235) * upgrade Argo workflow CRDs and controller to v4.0.3 * update controller image version to v4.0.3 (cherry picked from commit 155a669) * Update amd-gpu-operator.clusterserviceversion.yaml --------- Co-authored-by: Uday Bhaskar <udayb@amd.com> Co-authored-by: Praveen Kumar Shanmugam <58961022+spraveenio@users.noreply.github.com>
…OCm#500) * [Fix] GPUOP-607 fail the ANR workflow when imagePullBackOff * Update internal/controllers/remediation/scripts/test.sh * Update internal/controllers/remediation/scripts/test.sh --------- (cherry picked from commit 344e480) Signed-off-by: yansun1996 <Yan.Sun3@amd.com> Co-authored-by: Yan Sun <Yan.Sun3@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…m#501) * GPUOP-618 fix helm upgrade issue with latest Argo CRDs (#1283) (cherry picked from commit fe9ec91) * Apply suggestion from @biluriuday --------- Co-authored-by: Uday Bhaskar <udayb@amd.com>
* anr - fixes for applylabels step * multiple anr fixes (cherry picked from commit b33e4c9) Co-authored-by: Uday Bhaskar <udayb@amd.com>
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
* enable npd and anr e2e sims * increase validation check duration (cherry picked from commit 67defdf) Co-authored-by: Uday Bhaskar <udayb@amd.com>
Fix two bugs in DeviceConfig node assignment management: 1. buildNodeAssignments now logs and skips node assignment conflicts instead of returning a fatal error. A CR-level conflict should not block the entire operator — the runtime validateNodeAssignments check already handles this per-CR during reconciliation. 2. Remove premature updateNodeAssignments call during finalization that freed nodes from the in-memory map before the finalizer was removed. Node cleanup is now handled solely via the NotFound path after CR garbage collection, preventing other DeviceConfigs from claiming nodes mid-finalization. Also adds DRA driver DaemonSet cleanup to the finalization path, which was previously only handled during normal reconciliation. (cherry picked from commit a945553) Co-authored-by: Nitish Bhat <bhatnitish@gmail.com>
…d and it's E2Es (#1267) (ROCm#508) * DCM: mount default ConfigMap when spec.configManager.config is omitted When DeviceConfig.spec.configManager.config is nil or has an empty name, the DCM DaemonSet now always mounts a ConfigMap volume named default-dcm-config (configurable by setting spec.configManager.config.name). Add E2E coverage (TestDCMDefaultConfigMapWhenConfigOmitted), cluster_test helpers, SIM skips for GPU-only partition tests, and align E2E_DCM_IMAGE in dev.env with v1.4.1. * Helm default CM + operator EnsureDefaultDCMConfigMap + E2E/docs * changes * address comments * comments * dcm changes (cherry picked from commit e9c1e91) Co-authored-by: nikhilsk <47417007+nikhilsk@users.noreply.github.com>
…4) (ROCm#526) (cherry picked from commit fa1328d092487fa7482c7d3166bbd5fd5fe6d74d) Co-authored-by: Srivatsa Sangli <58572624+sangli-pensando@users.noreply.github.com>
…nual test examples (#1364) (#1365) (ROCm#527) Add privileged SCC permissions to all ClusterRole definitions in manual/scheduled test documentation to support OpenShift deployments. (cherry picked from commit 9915f721319cd7bf8fcb2ac581092473c0c3dc56) Co-authored-by: Yan Sun <Yan.Sun3@amd.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* GPUOP-640 update remediation documentation * fix argo helm chart version for openshift (cherry picked from commit 1f246648b635f53b937c8003aa15278a67b9a008) (cherry picked from commit a968197d9974e22087eefe3cad2b7a184bf848c9) Co-authored-by: Uday Bhaskar <udayb@amd.com>
* metricsclient cli change * add test/e2e dependency to e2e sim * increase timeout (cherry picked from commit 5b1e46a) Co-authored-by: Praveen Kumar Shanmugam <58961022+spraveenio@users.noreply.github.com>
…rt (#1337) (ROCm#522) * Add workflow and workflow-triggered pod collection to techsupport Enhance the techsupport_dump.sh script to collect workflow CRs and workflow-triggered pods when auto node remediation feature is enabled. This helps with debugging workflow-based node remediation issues. Changes: - Add WORKFLOW_RESOURCES variable for workflow CRs - Collect workflow CRs (get, describe, yaml/json output) - Collect workflow-triggered pods identified by workflows.argoproj.io/workflow label - Add per-node log collection for workflow-triggered pods - Include error resilience with || true for ephemeral workflow pods * Make pod_logs function resilient to ephemeral pod failures Add error handling (|| true) to kubectl logs commands in pod_logs function to prevent script termination when collecting logs from ephemeral/terminated workflow pods. With set -e enabled, failed log collection would previously abort the entire techsupport run before reaching error handlers. Changes: - Add '2>&1 || true' to current container logs command - Add '2>&1 || true' to previous container logs command - Ensures individual pod log failures don't terminate script execution - Critical for short-lived workflow pods that may be deleted during collection * Add workflow controller pod collection to techsupport Collect information and logs from the workflow controller pod (identified by label app=amd-gpu-operator-workflow-controller) in addition to workflow CRs and workflow-triggered pods. Changes: - Add workflow controller pod collection in cluster-wide section - kubectl get/describe output in both text and JSON/YAML format - Add workflow controller pod log collection per node - Maintains error resilience with || true for optional feature --------- (cherry picked from commit 70b0104) Co-authored-by: Yan Sun <Yan.Sun3@amd.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…clude "useSourceImage" for example DeviceConfig (ROCm#523) * Include DeviceConfig driver.useSourceImage in OCP(olm) install docs Signed-off-by: Landon LaSmith <LLaSmith@redhat.com> * Airgapped: Sync driver version with OpenShift(OLM) documentation Signed-off-by: Landon LaSmith <LLaSmith@redhat.com> --------- Signed-off-by: Landon LaSmith <LLaSmith@redhat.com>
* Add DeviceConfig collection in testmonitor * Fix DRA Tests
Add ServiceAccount, ClusterRole, and ClusterRoleBinding for the DRA driver so it can run on OpenShift clusters. The ClusterRole grants: - privileged SCC (required for OpenShift) - resourceslices CRUD (to publish GPU resources) - resourceclaims get (to process allocation requests) - nodes get (to look up node info for ResourceSlice ownership) Also add the DRA driver service account to the OLM bundle's extra-service-accounts list so OLM-managed installs create the SA. # Conflicts: # bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml
…d (#1388) * Create DeviceClass from operator code on OpenShift when DRA is enabled On OpenShift, operator-sdk cannot deploy DeviceClass resources via the OLM bundle. This adds handleDeviceClass to the reconciler which creates the gpu.amd.com DeviceClass using an unstructured client when running on OpenShift with DRA driver enabled. The DeviceClass is cluster-scoped and shared, so it is created once (AlreadyExists is handled gracefully) and never deleted on DeviceConfig finalization. * Use deviceClassName constant instead of hardcoded string Address review feedback: extract "gpu.amd.com" into a const and use it throughout handleDeviceClass.
…opriate AMD GPU driver versions * add new `slesCMNameMapper` to parse SLES version strings like 'SUSE Linux Enterprise Server 15 SP6' to 'sles-15.6' * add `SLESDefaultDriverVersionsMapper` to select driver versions - SLES 15 SP6/SP7 -> driver 7.0.2 (ref: https://repo.radeon.com/amdgpu-install/7.0.2/sle/) - SLES 15 SP5 -> driver 6.2.2 (ref: https://repo.radeon.com/amdgpu-install/6.2.2/sle/) * register both 'sles' and 'suse' identifiers in mappers Co-authored-by: alex-isv <alex.zacharow@suse.com>
4da60d3 to
666be77
Compare
… SUSE AMD GPU driver image
…sles" * although, use-specified `BaseImageRegistry` still takes precedence * also extend tests in `internal/kmmodule/kmmodule_test.go` to test above changes in `resolveDockerfile` func
666be77 to
7dfec5e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
(based on comment #365 (review) from the original PR)
Motivation
This PR aim at adding support for SUSE Linux Enterprise Server (SLES) 15 SP5+ to the AMD GPU operator.
Technical Details
781c5b5 - add support for detecting SLES nodes and automatically selecting appropriate AMD GPU driver versions
slesCMNameMapperto parse SLES version strings like 'SUSE Linux Enterprise Server 15 SP6' to 'sles-15.6'SLESDefaultDriverVersionsMapperto select driver versions0170a9a - add SLES Dockerfile template (
DockerfileTemplate.sles) for building AMD GPU drivers on SLES (currently, I've skipped adding the GIM Dockerfile template for SLES, will tackle it once this goes through).c2dce44 - docs: update example/deviceconfig_example.yaml<- dropped4da60d3 - use "registry.suse.com" as the default base image registry if OS == "sles"
BaseImageRegistrystill takes precedenceinternal/kmmodule/kmmodule_test.goto test above changes inresolveDockerfilefuncTest Plan
Test Result
truncated output of
make unit-testafter new added tests in b625441output from tests added as part of 4da60d3
Submission Checklist