Skip to content

docs: fix CRITICAL and MAJOR findings from E2E docs audit#283

Merged
ArangoGutierrez merged 5 commits intoNVIDIA:mainfrom
ArangoGutierrez:fix/docs-e2e-audit-2026-04-07
Apr 7, 2026
Merged

docs: fix CRITICAL and MAJOR findings from E2E docs audit#283
ArangoGutierrez merged 5 commits intoNVIDIA:mainfrom
ArangoGutierrez:fix/docs-e2e-audit-2026-04-07

Conversation

@ArangoGutierrez
Copy link
Copy Markdown
Collaborator

Summary

End-to-end docs audit found 4 CRITICAL, 7 MAJOR, and 14 MINOR findings across all documentation paths. This PR fixes all CRITICAL and MAJOR findings, plus several MINOR ones.

Findings Fixed

CRITICAL

  • C1: GPU Operator Quick Start was not self-contained — inlined nvidia-container-toolkit install, CDI config, containerd restart
  • C2: tests/mocknvml/Dockerfile broken — only copied main.go, missing bridge_tests.go (added in PR test(bridge): add integration tests for bridge edge cases #269)
  • C3: with-fgo demo missing nodeSelector — DaemonSet ran everywhere instead of integration pool
  • C4: Inconsistent FGO topology schemas between demo and integration guide

MAJOR

  • M1: --stats counts wrong in development.md (400/111 to 396/107)
  • M2: DRA prerequisite wrong: Kubernetes 1.31+ to 1.32+ (chart requires >=1.32.0)
  • M3: GPU Operator helm flags diverge from CI — removed extra flags
  • M4: CI workflow_dispatch default Go version 1.23 to 1.25 (go.mod requires 1.25)
  • M5: Stale NGC credential claims — GPU Operator images are public
  • M6: Multi-node section expanded from snippet to complete 7-step Quick Start
  • M7: topology.yaml reference pointed to nonexistent file — converted to heredoc

MINOR (included)

  • m1: Project structure tree in development.md missing files
  • m5: tests/mocknvml/go.mod Go version 1.23 to 1.25
  • m8: Integration guide missing cleanup section
  • m14: Topology ConfigMap missing namespace

Files Changed (8)

  • tests/mocknvml/Dockerfile — Fix broken build (C2)
  • tests/mocknvml/go.mod — Go 1.23 to 1.25 (m5)
  • .github/workflows/nvml-mock-e2e.yaml — Default Go version + NGC comments (M4, M5)
  • tests/e2e/README.md — Remove stale NGC claims (M5)
  • deployments/nvml-mock/helm/nvml-mock/README.md — GPU Operator + multi-node guides (C1, M2, M3, M6)
  • docs/demo/with-fgo/README.md — nodeSelector, topology schema, heredoc (C3, C4, M7, m14)
  • docs/integrations/fake-gpu-operator.md — Cleanup section (m8)
  • docs/development.md — Stats + project tree (M1, m1)

- Dockerfile: copy all .go files (*.go) instead of only main.go,
  fixing build failure after bridge_tests.go was added in PR NVIDIA#269 (C2)
- go.mod: update Go version from 1.23 to 1.25 to match root module (m5)

Found by docs E2E audit agents.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
- workflow_dispatch default golang_version: 1.23 → 1.25 (M4)
- Clarify GPU Operator images are public; standalone GFD/validator
  steps (if: false) may need auth for nvcr.io standalone images (M5)
- Update tests/e2e/README.md to remove stale NGC credential claims

Found by docs E2E audit agents.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
…ides

CRITICAL fixes:
- C1: GPU Operator section now self-contained — inlined nvidia-container-toolkit
  install, CDI mode configuration, and containerd restart steps that were
  previously behind a 'see the E2E workflow' reference

MAJOR fixes:
- M2: DRA prerequisites: Kubernetes 1.31+ → 1.32+ (matches chart constraint)
- M3: GPU Operator helm install: removed nfd.enabled=false and
  operator.defaultRuntime=containerd flags that diverge from CI
- M6: Multi-node section expanded from brief snippet to complete 7-step
  Quick Start (Kind create, image build, nvidia-ctk loop, helm installs
  with --wait --timeout, device plugin deploy, GPU verification with
  polling, cleanup)

Found by docs E2E audit agents.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
CRITICAL fixes:
- C3: Added nodeSelector to helm install restricting DaemonSet to
  integration pool nodes instead of running everywhere
- C4: Aligned topology ConfigMap schema — both docs now use
  nodeGroups/gpuModel (was nodePools/gpuProfile in demo README)

MAJOR fixes:
- M7: Converted 'kubectl apply -f topology.yaml' (nonexistent file)
  to heredoc that creates the ConfigMap inline

MINOR fixes:
- m8: Added cleanup section to integration guide
- m14: Added namespace: gpu-operator to topology ConfigMap metadata

Found by docs E2E audit agents.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
MAJOR fixes:
- M1: --stats output corrected from 400/111 to 396/107 (stubs 289 unchanged)

MINOR fixes:
- m1: Added missing files to project tree: bridge/events.go,
  engine/invalid_device.go, engine/version.go, config files,
  updated docs/ directory structure

Found by docs E2E audit agents.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez ArangoGutierrez marked this pull request as ready for review April 7, 2026 13:35
@ArangoGutierrez ArangoGutierrez enabled auto-merge (squash) April 7, 2026 14:16
@ArangoGutierrez ArangoGutierrez merged commit 3172c9f into NVIDIA:main Apr 7, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant