Skip to content

Fix nil pointer panic in upgrade controller when DrainSpec is not set#2418

Merged
tariq1890 merged 1 commit intoNVIDIA:mainfrom
harche:fix/upgrade-controller-nil-drainspec
May 4, 2026
Merged

Fix nil pointer panic in upgrade controller when DrainSpec is not set#2418
tariq1890 merged 1 commit intoNVIDIA:mainfrom
harche:fix/upgrade-controller-nil-drainspec

Conversation

@harche
Copy link
Copy Markdown
Contributor

@harche harche commented May 1, 2026

Summary

  • Adds a nil check for DrainSpec in the upgrade controller before accessing PodSelector
  • When DrainSpec is nil, initializes it with an empty DrainSpec{} so the default PodSelector is applied
  • Adds unit tests covering nil, empty, and pre-existing PodSelector cases

Root cause

DriverUpgradePolicySpec.DrainSpec is a pointer (*DrainSpec, json:"drain,omitempty"). When UpgradePolicy.AutoUpgrade is true but no drain section is specified in the ClusterPolicy, DrainSpec is nil. The code at upgrade_controller.go:172 dereferences DrainSpec.PodSelector without a nil check, causing a panic:

panic: runtime error: invalid memory address or nil pointer dereference
  controllers.(*UpgradeReconciler).Reconcile
    /workspace/controllers/upgrade_controller.go:172

Verified on cluster

  • Environment: OpenShift 4.21.5 / GCP, GPU Operator v26.3.1 with NVIDIA T4
  • Before fix: autoUpgrade: true without drain section → upgrade controller panic loop (1284 panics observed)
  • After fix: Same config → zero panics, upgrade controller reconciles cleanly

Fixes: #2417

Test plan

  • Unit tests pass (TestSetDrainSpecPodSelector — nil, empty, existing PodSelector)
  • go build ./controllers/ compiles
  • Verified on OpenShift 4.21.5 cluster with autoUpgrade: true and no drain section — zero panics

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 1, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@harche harche marked this pull request as ready for review May 1, 2026 15:54
@harche harche force-pushed the fix/upgrade-controller-nil-drainspec branch from 9f17c4c to 4d8fc92 Compare May 1, 2026 15:55
@rahulait
Copy link
Copy Markdown
Contributor

rahulait commented May 1, 2026

/ok to test 4d8fc92

@rahulait
Copy link
Copy Markdown
Contributor

rahulait commented May 4, 2026

LGTM. Can you please rebase your PR from latest main. We had some CI fixes which are required to run all the tests.

@harche harche force-pushed the fix/upgrade-controller-nil-drainspec branch from 4d8fc92 to b9b4372 Compare May 4, 2026 16:51
@rahulait
Copy link
Copy Markdown
Contributor

rahulait commented May 4, 2026

/ok to test b9b4372

Comment thread controllers/upgrade_controller_test.go Outdated
Copy link
Copy Markdown
Contributor

@tariq1890 tariq1890 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @harche for your contribution. Requesting changes in the copyright header used

When UpgradePolicy.AutoUpgrade is true but no drain section is specified
in the ClusterPolicy, the upgrade controller panics at
upgrade_controller.go:172 due to a nil DrainSpec dereference.

Add a nil check for DrainSpec and initialize it with defaults when not
set, consistent with how UpgradePolicy itself is handled on line 160.

Fixes: NVIDIA#2417

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Harshal Patil <12152047+harche@users.noreply.github.com>
@harche harche force-pushed the fix/upgrade-controller-nil-drainspec branch from 8728bdc to 03abb74 Compare May 4, 2026 18:53
@tariq1890
Copy link
Copy Markdown
Contributor

/ok to test 03abb74

@tariq1890 tariq1890 merged commit eb8cd3c into NVIDIA:main May 4, 2026
18 of 19 checks passed
@harche harche deleted the fix/upgrade-controller-nil-drainspec branch May 7, 2026 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Nil pointer panic in upgrade controller when DrainSpec is not set in ClusterPolicy

3 participants