Skip to content

ci(windows): retry submodule fetch on transient github.com 500s#5969

Open
Fedr wants to merge 2 commits intomasterfrom
cicd/retry-windows-checkout
Open

ci(windows): retry submodule fetch on transient github.com 500s#5969
Fedr wants to merge 2 commits intomasterfrom
cicd/retry-windows-checkout

Conversation

@Fedr
Copy link
Copy Markdown
Contributor

@Fedr Fedr commented Apr 23, 2026

Summary

Windows CI hit transient github.com submodule-clone failures, e.g. run 24845654039:

error: RPC failed; HTTP 500 curl 22 The requested URL returned error: 500
fatal: expected 'packfile'
fatal: clone of 'https://github.com/AcademySoftwareFoundation/openvdb' into submodule path 'thirdparty/openvdb/v9/openvdb' failed

Six different third-party submodules failed this way within ~25 s (openvdb, parallel-hashmap, tinygltf, tinyxml2, zlib-ng, openvdb/v10) — all HTTP 500s from github.com itself, not an infra issue on our side.

actions/checkout@v6 has its own retry logic and the log actually shows Failed to clone 'thirdparty/openvdb/v9/openvdb'. Retry scheduled — but it only retries each submodule once and the second attempt hit the same 500 while github.com was still flaky.

Change

Replace submodules: true in actions/checkout with an explicit retry loop in the next step:

- name: Checkout
  uses: actions/checkout@v6
  with:
    submodules: false

- name: Checkout submodules (with retries)
  shell: bash
  run: |
    for i in 1 2 3; do
      if git -c protocol.version=2 submodule update --init --force --depth=1; then
        break
      fi
      if [ "$i" = "3" ]; then
        echo "::error::submodule update failed after 3 attempts"
        exit 1
      fi
      echo "::warning::submodule update attempt $i failed — retrying in 30s"
      sleep 30
    done

Three attempts with a 30 s wait — enough to ride out the ~25 s github.com 500 spike observed in the failing run. --force recovers from any partial clone left by an earlier attempt. --depth=1 and -c protocol.version=2 match the flags actions/checkout itself uses internally.

Why not Wandalen/wretry.action

An earlier attempt on this PR wrapped the whole actions/checkout step in Wandalen/wretry.action@v3.8.0. It triggered a startup_failure — the workflow parser refused to schedule any job. Root cause appears to be that wretry.action's outer composite-action layer dispatches to an inner _js_action that runs on node20, but actions/checkout@v6 uses node24; the handoff doesn't work and the whole workflow is rejected before any step runs. wretry.action is thinly maintained and has an open issue (#193) with no ETA for a fix, so a simple in-workflow retry is the right trade.

Scope

Only build-test-windows.yml's checkout area. Other platforms' workflows aren't touched; disable-build-* labels suppress non-windows CI in this PR.

Test plan

  • Windows CI on this PR completes normally (attempt 1 succeeds; warnings silent).
  • If a future CI run hits a github.com 500 during submodule fetch, retries kick in and the job still succeeds.

@Fedr Fedr changed the title ci(windows): retry Checkout via Wandalen/wretry.action ci(windows): retry submodule fetch on transient github.com 500s Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant