Skip to content

ci(ubuntu22): retry the apt.kitware.com CMake install on transient failure#5965

Closed
Fedr wants to merge 1 commit intomasterfrom
ci/kitware-cmake-retry
Closed

ci(ubuntu22): retry the apt.kitware.com CMake install on transient failure#5965
Fedr wants to merge 1 commit intomasterfrom
ci/kitware-cmake-retry

Conversation

@Fedr
Copy link
Copy Markdown
Contributor

@Fedr Fedr commented Apr 23, 2026

Summary

Keep using apt.kitware.com as the CMake source in docker/ubuntu22Dockerfile, but wrap the install block in a retry loop so a single connection-refused doesn't fail the whole image build.

Why keep Kitware's repo rather than swap it

Ubuntu 22.04's apt-supplied CMake (3.22) is too old for MRCuda's CUDA_STANDARD 20 against NVCC 12.6 — CMake's table of NVCC compile-flags for that combination only started covering recent CUDA toolkits in 3.25+. We had a failing run on the "drop the upgrade entirely" experiment (#5963, closed):

CMake Error in build/Release/CMakeFiles/CMakeTmp/CMakeLists.txt:
  Target "cmTC_*" requires the language dialect "CUDA20" (with compiler
  extensions), but CMake does not know the compile flags to use to enable it.
Call Stack:
  source/MRCuda/CMakeLists.txt:4 (project)

So we do need a newer CMake than Ubuntu jammy ships. Kitware's apt repo is the existing path, and this PR makes it durable against the transient outages we keep hitting.

Why retry

Connection-refused outages are typically minutes-long; a short retry loop covers them without requiring us to mirror the repo ourselves or swap providers.

The change

RUN set -eux; \
    attempt=1; \
    until ( set -e; \
        apt remove --purge --auto-remove -y cmake; \
        apt update; \
        apt install -y software-properties-common lsb-release; \
        apt clean all; \
        wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - | tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null; \
        apt-add-repository "deb https://apt.kitware.com/ubuntu/ $(lsb_release -cs) main"; \
        apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 42D5A192B819C5DA; \
        apt update; \
        apt install -y kitware-archive-keyring; \
        rm -f /etc/apt/trusted.gpg.d/kitware.gpg; \
        apt update; \
        apt install -y cmake; \
    ); do \
        if [ $attempt -ge 5 ]; then \
            echo "ERROR: Kitware CMake install failed after $attempt attempts" >&2; \
            exit 1; \
        fi; \
        delay=$((attempt * 15)); \
        echo "WARN: Kitware CMake install attempt $attempt failed; retrying in ${delay}s" >&2; \
        sleep $delay; \
        attempt=$((attempt + 1)); \
    done

Retry shape

  • 5 attempts total. Total potential wait before giving up: ~2m30s (delays of 15s, 30s, 45s, 60s between attempts 1–5).
  • Subshell with set -e around the install block means any single command failure aborts the subshell and triggers a retry — we don't continue past a failed apt update only to hit a cascading error later.
  • Each retry is logged with the attempt number and the next delay on stderr, so the reason for slow image builds is visible in the CI log (unlike --quiet-style silent suppression).
  • Hard failure after attempt 5 with a clear final message, not silent infinite loop.

Idempotency of the retried block

All commands inside the subshell are safe to re-run on a partial-failure retry:

  • apt remove cmake — noop if cmake was already removed on the previous attempt.
  • apt-add-repository "deb ..." — idempotent; skips adding a duplicate source line.
  • apt-key adv --recv-keys ... — idempotent; skips if key already in the keyring.
  • apt install -y X — noop if X is already installed and up to date.
  • rm /etc/apt/trusted.gpg.d/kitware.gpgrm -f (only substantive change beyond the wrapping): after a failed attempt may have already removed the file, the next attempt's rm without -f would abort the subshell with set -e.

No other command changes; order and flags are identical to the current recipe.

Scope

Only docker/ubuntu22Dockerfile. docker/ubuntu24Dockerfile doesn't upgrade CMake (Ubuntu 24.04 noble's apt ships CMake 3.28 which does have the NVCC 12.6 × C++20 flag table entry) and has no apt.kitware.com recipe to retry.

Test plan

  • prepare-image / linux-image-build-upload (ubuntu22, x64) succeeds
  • prepare-image / linux-image-build-upload (ubuntu22, arm64) succeeds — same recipe, same path
  • On success, step is no slower than before — the retry block exits after attempt 1 normally; set -u / set -e costs nothing
  • If apt.kitware.com is flaky during this very run, the retry messages show up in the log and the build still completes

🤖 Generated with Claude Code

…ilure

apt.kitware.com has been unreliable in ways our image builds keep
hitting. Most recent example: ubuntu22-arm64 image build on PR #5959
(run 24800811051) died at this step with

  Could not connect to apt.kitware.com:443 (66.194.253.25)
  - connect (111: Connection refused)
  ...
  E: Unable to locate package kitware-archive-keyring

while the x64 leg in the same run succeeded at the exact same command.
Dec 2024 had a full-day outage from a mis-issued SSL cert as well
(https://discourse.cmake.org/t/kitware-apt-repo-down/13184).

Keep using apt.kitware.com (it delivers a newer CMake than Ubuntu
22.04's apt, which MeshLib's MRCuda needs for CUDA20 dialect support
against NVCC 12.6) but wrap the whole block in a retry loop:

  - 5 attempts total
  - Backoff between retries: 15s, 30s, 45s, 60s (total potential wait
    before giving up: ~2m30s)
  - Log each retry with the attempt number and the delay so the cause
    is visible in the image-build log
  - Fail hard after attempt 5 with a clear message

All apt / apt-add-repository / apt-key / wget commands inside the
block are idempotent: re-running the block from scratch after a
partial failure is safe (apt remove cmake is a noop if cmake was
already removed on the previous attempt, apt-add-repository skips
adding a duplicate deb line, etc.). Only changes:

  - rm /etc/apt/trusted.gpg.d/kitware.gpg -> rm -f (in case a previous
    failed attempt already removed it)

Otherwise the commands and their order are unchanged.
@Fedr
Copy link
Copy Markdown
Contributor Author

Fedr commented Apr 23, 2026

Closing as superseded. #5963 merged the stronger fix: drop the apt.kitware.com step entirely and fall back to CMAKE_CUDA_STANDARD=17 on CMake < 3.25.2 (the only environment in our matrix where we actually needed the Kitware upgrade was Ubuntu 22.04's cmake 3.22 lacking the NVCC+C++20 flag-table entry). Retry-on-failure covers the symptom; the merged approach removes the Kitware dependency from ubuntu22 altogether, which is strictly better.

@Fedr Fedr closed this Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant