Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
6583605
Updating the github nightly build with uv to get optional deps better.
coreyjadams Feb 19, 2026
e69c502
Ensure transformer engine is skipped in CI build until cuda13 fix com…
coreyjadams Feb 19, 2026
e25278f
Include hidden files in ci venv uploads
coreyjadams Feb 19, 2026
af7fb76
Use a custom action for the install steps of the package.
coreyjadams Feb 19, 2026
d889254
Try adding an env variable to use headless pyvista off screen
coreyjadams Feb 19, 2026
23b3caa
Skip pv pplotter errors
coreyjadams Feb 19, 2026
715d685
Update CI
coreyjadams Feb 20, 2026
3516338
Attempt to fix torch scatter build
coreyjadams Apr 9, 2026
d95548e
Try again with pyg
coreyjadams Feb 20, 2026
c8daf61
Add cmake. try again
coreyjadams Feb 20, 2026
6dbb34d
Testing another way
coreyjadams Feb 20, 2026
505e069
Add debuging options
coreyjadams Feb 20, 2026
4182b8d
Add a dockerfile build action. Switch to cuda 12
coreyjadams Feb 20, 2026
f1682df
make the cache pull robust
coreyjadams Feb 20, 2026
0119e91
Trying again
coreyjadams Feb 20, 2026
81f573f
Trying again again
coreyjadams Feb 20, 2026
81d0f1b
trying again again again
coreyjadams Feb 20, 2026
37a6c64
Trying again again again again
coreyjadams Feb 20, 2026
14df7eb
Trying again again again again again
coreyjadams Feb 21, 2026
0926551
try more agains
coreyjadams Feb 21, 2026
457f0de
who knows
coreyjadams Feb 21, 2026
b32b79a
Increase test tolerance. Upload test report as artifact
coreyjadams Feb 23, 2026
c40cb25
Turn off 3D convnd test, it's not numerically stable
coreyjadams Feb 23, 2026
f9c08a6
upload better report.
coreyjadams Feb 24, 2026
77f8427
fix workspace permissions
coreyjadams Feb 24, 2026
ff76b55
revert workspace changes, upload reports for coverage path and genera…
coreyjadams Feb 24, 2026
3ffbe21
Restore container pipeline against main.
coreyjadams Mar 12, 2026
d341cc6
rmove container build action from this pr
coreyjadams Mar 12, 2026
bcc279f
reintroduce pytorch-g deps on torch.
coreyjadams Mar 12, 2026
3dd1f22
updates in this pr:
coreyjadams Mar 17, 2026
e1eef5c
add explicit git repo
coreyjadams Mar 17, 2026
0ded62e
Update contracts in github nightly uv pipeline to ensure we build and…
coreyjadams Apr 23, 2026
4006806
Set default shell to bash
coreyjadams Apr 23, 2026
f94e38b
Update PR workflow, add cache contract update.
coreyjadams Apr 23, 2026
cd3f501
Make sure CI PR doesn't run automatically
coreyjadams Apr 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
177 changes: 177 additions & 0 deletions .github/CACHE_CONTRACT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
# Nightly UV Cache Contract

This document is the authoritative reference for how the
`Nightly Github UV Workflow`
([.github/workflows/github-nightly-uv.yml](workflows/github-nightly-uv.yml))
publishes Python environment caches and how downstream PR workflows must
consume them. PR gating relies on these contracts being honored on both
the producer and consumer side; do not weaken them without updating this
document.

## Two caches, two contracts

The pipeline maintains two strictly disjoint caches with different
invalidation rules. Conflating them is the historical bug class this
design exists to forbid.

### A. uv download cache (`~/.cache/uv`)

| Property | Value |
|---|---|
| Key | `<UV_CACHE_KEY_PREFIX>-latest` |
| Prefix encodes | container image + Python version + uv version |
| Suffix | literal `latest` (mutable slot, refreshed via delete-before-save) |
| Contents | every wheel uv has ever downloaded for this baseline; additive across lockfile changes |
| Invalidates when | container image, CUDA version, Python version, or uv version changes (prefix change → new slot) |
| Does **not** invalidate on | `uv.lock` or `pyproject.toml` changes |
| Restore semantics | **fail-open**; missing cache only costs download time, never correctness |
| Save semantics | always save (delete the existing entry first, then save, then verify with `gh cache list`) |

The uv download cache is purely a speed optimisation. Anything that
correctness depends on must come from the venv cache.

### B. venv cache (`.venv`)

| Property | Value |
|---|---|
| Key | `<VENV_CACHE_KEY_PREFIX>-<lockhash>` |
| Prefix encodes | container image + Python version + uv version + extras tag (e.g. `cu12`) |
| Suffix | `hashFiles('uv.lock', 'pyproject.toml')`, computed once per job and propagated via `needs.<job>.outputs.lockhash` |
| Contents | the fully realized `.venv` produced by `uv sync --frozen --group dev --extra <tag>` against the committed lockfile |
| Invalidates when | any prefix component changes, or the lockfile hash changes |
| Restore semantics | **exact-match only, no `restore-keys` fallback** |
| Save semantics | standard `actions/cache/save`. Same lockhash → same content → save no-ops, which is correct |

The extras tag (`cu12`, `cu13`, ...) is part of the prefix so cu12 and
cu13 builds never overwrite each other.

The lockhash includes both `uv.lock` and `pyproject.toml`. If a PR
touches `pyproject.toml` without regenerating `uv.lock`, the build fails
loudly during `uv sync --frozen` rather than silently producing a
mismatched venv.

## PR consumer contracts

PR workflows that gate on the nightly venv MUST implement one of two
exhaustive paths.

### Contract 1 — PR does not touch `pyproject.toml` or `uv.lock`

The PR's lockhash equals the hash that the most recent successful nightly
saved under.

```yaml
- name: Restore uv download cache (fail-open)
uses: actions/cache/restore@v4
with:
path: ~/.cache/uv
key: <UV_CACHE_KEY_PREFIX>-latest
# NO fail-on-cache-miss. Missing uv cache is acceptable here.

- name: Restore venv cache (exact match, MUST hit)
uses: actions/cache/restore@v4
with:
path: .venv
key: <VENV_CACHE_KEY_PREFIX>-${{ hashFiles('uv.lock', 'pyproject.toml') }}
fail-on-cache-miss: true
# NO restore-keys. A partial match would silently degrade test
# validity.

- name: Use the env, read-only
env:
UV_FROZEN: "1"
UV_NO_SYNC: "1"
run: |
.venv/bin/python -c "import torch; print(torch.__version__)"
uv run --no-sync python -m pytest ...
```

Guarantees:

- Either the venv is byte-identical to what nightly validated, or the
job fails on cache miss.
- `UV_FROZEN=1` and `UV_NO_SYNC=1` (plus the `--no-sync` flags) make it
impossible for any subsequent `uv run` to mutate the restored venv.
- `physicsnemo` itself is installed editable, so the PR's source code
changes are picked up without rebuilding the venv.

### Contract 2 — PR updates `pyproject.toml` and/or `uv.lock`

The PR's lockhash is new; the venv cache misses by design.

```yaml
- name: Restore uv download cache (fail-open)
uses: actions/cache/restore@v4
with:
path: ~/.cache/uv
key: <UV_CACHE_KEY_PREFIX>-latest

- name: Restore venv cache (will miss; that is fine)
id: venv-restore
uses: actions/cache/restore@v4
with:
path: .venv
key: <VENV_CACHE_KEY_PREFIX>-${{ hashFiles('uv.lock', 'pyproject.toml') }}
# No fail-on-cache-miss; we expect to miss on lock-change PRs.

- name: Clean-build venv
if: steps.venv-restore.outputs.cache-hit != 'true'
env:
UV_LINK_MODE: copy
UV_FROZEN: "1"
UV_NO_SYNC: "1"
run: |
rm -rf .venv
uv sync --frozen --group dev --extra cu12
# Optional: assert the lockfile was not mutated, e.g. with sha256sum
# before/after. setup-uv-env does this automatically.
```

Guarantees:

- `rm -rf .venv` ensures no leftover state from a previous PR push or a
partial restore-keys hit. (Restore-keys is forbidden anyway, but this
is cheap insurance.)
- `--frozen` + `UV_FROZEN=1` ensure the resolver cannot rewrite
`uv.lock` in CI. If the PR shipped a stale lock, the job fails fast
with a clear error rather than papering over the mismatch.
- The uv download cache (restored fail-open) supplies most wheels, so
the rebuild is fast even though the venv itself is fresh.

## Operational notes

- **Concurrency**: the nightly workflow declares
`concurrency: nightly-github-uv` with `cancel-in-progress: false` so
two overlapping runs cannot race on the static `-latest` uv cache key.
- **Save verification**: after `actions/cache/save@v4` writes the uv
download cache slot, the workflow re-queries `gh cache list` to
confirm the entry exists. `cache/save` silently no-ops on key
collision; without verification a corrupted slot can persist for days.
- **Lockfile-mutation guard**: [.github/actions/setup-uv-env/action.yml](actions/setup-uv-env/action.yml)
snapshots `sha256(uv.lock)` and `sha256(pyproject.toml)` before any uv
command runs and compares them again at the end. Any drift (caused by
a forgotten `--frozen`, a dropped `--extra`, etc.) trips this guard
and fails the job with a pointed error message.
- **uv version pin**: `bootstrap-cudnn-ci` installs a pinned uv version
via `https://astral.sh/uv/<version>/install.sh` and asserts the
installed binary matches. The pin is what allows the uv version to
appear in the cache key prefix without surprise invalidations.

## Bumping any of the baseline values

If you change the container image, CUDA version, Python version, uv
version, or extras tag, you must update both:

1. The matching `env:` value at the top of both
[.github/workflows/github-nightly-uv.yml](workflows/github-nightly-uv.yml)
and
[.github/workflows/github-pr.yml](workflows/github-pr.yml).
2. The corresponding literal embedded in `UV_CACHE_KEY_PREFIX` and
`VENV_CACHE_KEY_PREFIX` (GitHub Actions does not support env-to-env
references within the same `env:` block, so these are kept in
lockstep manually).

The first nightly run after a baseline bump will miss both caches, do a
full rebuild, and republish under the new prefix. Existing PR workflows
that pin to the old prefix will hard-fail until they are updated, which
is the desired behaviour.
74 changes: 74 additions & 0 deletions .github/actions/bootstrap-cudnn-ci/action.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
name: Bootstrap cuDNN CI container
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question, does this test trigger any packages to get built from source or is this like core deps and we have binaries for everything?

Copy link
Copy Markdown
Collaborator

@NickGeneva NickGeneva Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context, just curious about if this is all the deps needed to install everything (and build deps from source when needed)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does trigger some deps to build from source, sometimes. torch_geometric and family, natten both come to mind. transformer_engine will be in the pile eventually too.

It does not do it all the time though: the uv caching will, since uv itself will cache the binaries locally, have the pre-built wheel from last night available tonight, if that makes sense. And the next night, and the next night, and so on until the the cache is invalid or the lock file requires a new build. So the build doesn't trigger everything all the time.

The first build took forever though.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And no we're not catching everything yet. I need to get TE for sure, still missing a few others I think. I've got a lot of the deps, though. I was hoping to roll out incrementally from here - a part of the reporting stage was to help ID which tests are skipped due to missing software deps and fix that.

Copy link
Copy Markdown
Collaborator

@NickGeneva NickGeneva Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah its the builds form packages from source is what is raising my previous questions about cache invalidation / refreshes.

This is where leaning hard on caching and aggressively using it (and not refreshing it all the time) is very very useful. Initial cache builds for e2s can take hours with e2grid, natten, flash-attn, torch-harmoncis, etc can take hours... something that is useful practically to only do maybe at most once per week imo

managing the cache is uvs problem right. it does not need a fresh cache all the time. creating a new cache really just checks pypi is still alive (I hope so) and source builds still operate as intended. Could be nightly... but when things go wrong... you dont want all new PRs to get stuck building new caches as well because the lock file changed and now your having an issue with TE or other source build

description: Install OS dependencies and uv in CUDA cuDNN container jobs
inputs:
python-version:
description: Python major.minor expected in the container
required: false
default: "3.12"
uv-version:
description: |
Exact uv version to install. Pinning is required because the uv
version is part of the cache key prefix (a surprise uv upgrade would
otherwise silently invalidate the wheel store). Bump in lockstep
with the workflow's UV_VERSION env value.
required: false
default: "0.11.7"
runs:
using: composite
steps:
- name: Install system dependencies
shell: bash
run: |
set -euo pipefail
export DEBIAN_FRONTEND=noninteractive
apt-get update
apt-get install -y --no-install-recommends \
ca-certificates \
curl \
git \
gh \
build-essential \
cmake \
pkg-config \
python3 \
python3-dev \
python3-venv \
python3-pip \
zstd
ln -sf /usr/bin/python3 /usr/bin/python
rm -rf /var/lib/apt/lists/*

- name: Install uv (pinned)
shell: bash
env:
UV_VERSION: ${{ inputs.uv-version }}
run: |
set -euo pipefail
# Pinned installer URL: https://astral.sh/uv/<version>/install.sh
# ensures the installed binary version matches the cache-key tag.
curl -LsSf "https://astral.sh/uv/${UV_VERSION}/install.sh" | sh
# Modern uv (>= 0.5) installs to ~/.local/bin, not ~/.cargo/bin.
echo "$HOME/.local/bin" >> "$GITHUB_PATH"

- name: Print toolchain versions
shell: bash
env:
EXPECTED_UV_VERSION: ${{ inputs.uv-version }}
run: |
set -euo pipefail
python3 --version
uv --version
# Hard-fail if the installed uv does not match the pin -- otherwise
# the cache prefix would lie about which uv produced the wheels.
actual_uv="$(uv --version | awk '{print $2}')"
if [ "$actual_uv" != "$EXPECTED_UV_VERSION" ]; then
echo "::error::uv version mismatch: expected ${EXPECTED_UV_VERSION}, got ${actual_uv}"
exit 1
fi
gcc --version | head -n 1
cmake --version | head -n 1
Comment thread
coreyjadams marked this conversation as resolved.
if command -v nvcc >/dev/null 2>&1; then
nvcc --version
else
echo "nvcc not found on PATH"
fi
Loading
Loading