-
Notifications
You must be signed in to change notification settings - Fork 647
Update nightly uv pipeline #1493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
coreyjadams
wants to merge
35
commits into
main
Choose a base branch
from
update-nightly-uv-pipeline
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
35 commits
Select commit
Hold shift + click to select a range
6583605
Updating the github nightly build with uv to get optional deps better.
coreyjadams e69c502
Ensure transformer engine is skipped in CI build until cuda13 fix com…
coreyjadams e25278f
Include hidden files in ci venv uploads
coreyjadams af7fb76
Use a custom action for the install steps of the package.
coreyjadams d889254
Try adding an env variable to use headless pyvista off screen
coreyjadams 23b3caa
Skip pv pplotter errors
coreyjadams 715d685
Update CI
coreyjadams 3516338
Attempt to fix torch scatter build
coreyjadams d95548e
Try again with pyg
coreyjadams c8daf61
Add cmake. try again
coreyjadams 6dbb34d
Testing another way
coreyjadams 505e069
Add debuging options
coreyjadams 4182b8d
Add a dockerfile build action. Switch to cuda 12
coreyjadams f1682df
make the cache pull robust
coreyjadams 0119e91
Trying again
coreyjadams 81f573f
Trying again again
coreyjadams 81d0f1b
trying again again again
coreyjadams 37a6c64
Trying again again again again
coreyjadams 14df7eb
Trying again again again again again
coreyjadams 0926551
try more agains
coreyjadams 457f0de
who knows
coreyjadams b32b79a
Increase test tolerance. Upload test report as artifact
coreyjadams c40cb25
Turn off 3D convnd test, it's not numerically stable
coreyjadams f9c08a6
upload better report.
coreyjadams 77f8427
fix workspace permissions
coreyjadams ff76b55
revert workspace changes, upload reports for coverage path and genera…
coreyjadams 3ffbe21
Restore container pipeline against main.
coreyjadams d341cc6
rmove container build action from this pr
coreyjadams bcc279f
reintroduce pytorch-g deps on torch.
coreyjadams 3dd1f22
updates in this pr:
coreyjadams e1eef5c
add explicit git repo
coreyjadams 0ded62e
Update contracts in github nightly uv pipeline to ensure we build and…
coreyjadams 4006806
Set default shell to bash
coreyjadams f94e38b
Update PR workflow, add cache contract update.
coreyjadams cd3f501
Make sure CI PR doesn't run automatically
coreyjadams File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,177 @@ | ||
| # Nightly UV Cache Contract | ||
|
|
||
| This document is the authoritative reference for how the | ||
| `Nightly Github UV Workflow` | ||
| ([.github/workflows/github-nightly-uv.yml](workflows/github-nightly-uv.yml)) | ||
| publishes Python environment caches and how downstream PR workflows must | ||
| consume them. PR gating relies on these contracts being honored on both | ||
| the producer and consumer side; do not weaken them without updating this | ||
| document. | ||
|
|
||
| ## Two caches, two contracts | ||
|
|
||
| The pipeline maintains two strictly disjoint caches with different | ||
| invalidation rules. Conflating them is the historical bug class this | ||
| design exists to forbid. | ||
|
|
||
| ### A. uv download cache (`~/.cache/uv`) | ||
|
|
||
| | Property | Value | | ||
| |---|---| | ||
| | Key | `<UV_CACHE_KEY_PREFIX>-latest` | | ||
| | Prefix encodes | container image + Python version + uv version | | ||
| | Suffix | literal `latest` (mutable slot, refreshed via delete-before-save) | | ||
| | Contents | every wheel uv has ever downloaded for this baseline; additive across lockfile changes | | ||
| | Invalidates when | container image, CUDA version, Python version, or uv version changes (prefix change → new slot) | | ||
| | Does **not** invalidate on | `uv.lock` or `pyproject.toml` changes | | ||
| | Restore semantics | **fail-open**; missing cache only costs download time, never correctness | | ||
| | Save semantics | always save (delete the existing entry first, then save, then verify with `gh cache list`) | | ||
|
|
||
| The uv download cache is purely a speed optimisation. Anything that | ||
| correctness depends on must come from the venv cache. | ||
|
|
||
| ### B. venv cache (`.venv`) | ||
|
|
||
| | Property | Value | | ||
| |---|---| | ||
| | Key | `<VENV_CACHE_KEY_PREFIX>-<lockhash>` | | ||
| | Prefix encodes | container image + Python version + uv version + extras tag (e.g. `cu12`) | | ||
| | Suffix | `hashFiles('uv.lock', 'pyproject.toml')`, computed once per job and propagated via `needs.<job>.outputs.lockhash` | | ||
| | Contents | the fully realized `.venv` produced by `uv sync --frozen --group dev --extra <tag>` against the committed lockfile | | ||
| | Invalidates when | any prefix component changes, or the lockfile hash changes | | ||
| | Restore semantics | **exact-match only, no `restore-keys` fallback** | | ||
| | Save semantics | standard `actions/cache/save`. Same lockhash → same content → save no-ops, which is correct | | ||
|
|
||
| The extras tag (`cu12`, `cu13`, ...) is part of the prefix so cu12 and | ||
| cu13 builds never overwrite each other. | ||
|
|
||
| The lockhash includes both `uv.lock` and `pyproject.toml`. If a PR | ||
| touches `pyproject.toml` without regenerating `uv.lock`, the build fails | ||
| loudly during `uv sync --frozen` rather than silently producing a | ||
| mismatched venv. | ||
|
|
||
| ## PR consumer contracts | ||
|
|
||
| PR workflows that gate on the nightly venv MUST implement one of two | ||
| exhaustive paths. | ||
|
|
||
| ### Contract 1 — PR does not touch `pyproject.toml` or `uv.lock` | ||
|
|
||
| The PR's lockhash equals the hash that the most recent successful nightly | ||
| saved under. | ||
|
|
||
| ```yaml | ||
| - name: Restore uv download cache (fail-open) | ||
| uses: actions/cache/restore@v4 | ||
| with: | ||
| path: ~/.cache/uv | ||
| key: <UV_CACHE_KEY_PREFIX>-latest | ||
| # NO fail-on-cache-miss. Missing uv cache is acceptable here. | ||
|
|
||
| - name: Restore venv cache (exact match, MUST hit) | ||
| uses: actions/cache/restore@v4 | ||
| with: | ||
| path: .venv | ||
| key: <VENV_CACHE_KEY_PREFIX>-${{ hashFiles('uv.lock', 'pyproject.toml') }} | ||
| fail-on-cache-miss: true | ||
| # NO restore-keys. A partial match would silently degrade test | ||
| # validity. | ||
|
|
||
| - name: Use the env, read-only | ||
| env: | ||
| UV_FROZEN: "1" | ||
| UV_NO_SYNC: "1" | ||
| run: | | ||
| .venv/bin/python -c "import torch; print(torch.__version__)" | ||
| uv run --no-sync python -m pytest ... | ||
| ``` | ||
|
|
||
| Guarantees: | ||
|
|
||
| - Either the venv is byte-identical to what nightly validated, or the | ||
| job fails on cache miss. | ||
| - `UV_FROZEN=1` and `UV_NO_SYNC=1` (plus the `--no-sync` flags) make it | ||
| impossible for any subsequent `uv run` to mutate the restored venv. | ||
| - `physicsnemo` itself is installed editable, so the PR's source code | ||
| changes are picked up without rebuilding the venv. | ||
|
|
||
| ### Contract 2 — PR updates `pyproject.toml` and/or `uv.lock` | ||
|
|
||
| The PR's lockhash is new; the venv cache misses by design. | ||
|
|
||
| ```yaml | ||
| - name: Restore uv download cache (fail-open) | ||
| uses: actions/cache/restore@v4 | ||
| with: | ||
| path: ~/.cache/uv | ||
| key: <UV_CACHE_KEY_PREFIX>-latest | ||
|
|
||
| - name: Restore venv cache (will miss; that is fine) | ||
| id: venv-restore | ||
| uses: actions/cache/restore@v4 | ||
| with: | ||
| path: .venv | ||
| key: <VENV_CACHE_KEY_PREFIX>-${{ hashFiles('uv.lock', 'pyproject.toml') }} | ||
| # No fail-on-cache-miss; we expect to miss on lock-change PRs. | ||
|
|
||
| - name: Clean-build venv | ||
| if: steps.venv-restore.outputs.cache-hit != 'true' | ||
| env: | ||
| UV_LINK_MODE: copy | ||
| UV_FROZEN: "1" | ||
| UV_NO_SYNC: "1" | ||
| run: | | ||
| rm -rf .venv | ||
| uv sync --frozen --group dev --extra cu12 | ||
| # Optional: assert the lockfile was not mutated, e.g. with sha256sum | ||
| # before/after. setup-uv-env does this automatically. | ||
| ``` | ||
|
|
||
| Guarantees: | ||
|
|
||
| - `rm -rf .venv` ensures no leftover state from a previous PR push or a | ||
| partial restore-keys hit. (Restore-keys is forbidden anyway, but this | ||
| is cheap insurance.) | ||
| - `--frozen` + `UV_FROZEN=1` ensure the resolver cannot rewrite | ||
| `uv.lock` in CI. If the PR shipped a stale lock, the job fails fast | ||
| with a clear error rather than papering over the mismatch. | ||
| - The uv download cache (restored fail-open) supplies most wheels, so | ||
| the rebuild is fast even though the venv itself is fresh. | ||
|
|
||
| ## Operational notes | ||
|
|
||
| - **Concurrency**: the nightly workflow declares | ||
| `concurrency: nightly-github-uv` with `cancel-in-progress: false` so | ||
| two overlapping runs cannot race on the static `-latest` uv cache key. | ||
| - **Save verification**: after `actions/cache/save@v4` writes the uv | ||
| download cache slot, the workflow re-queries `gh cache list` to | ||
| confirm the entry exists. `cache/save` silently no-ops on key | ||
| collision; without verification a corrupted slot can persist for days. | ||
| - **Lockfile-mutation guard**: [.github/actions/setup-uv-env/action.yml](actions/setup-uv-env/action.yml) | ||
| snapshots `sha256(uv.lock)` and `sha256(pyproject.toml)` before any uv | ||
| command runs and compares them again at the end. Any drift (caused by | ||
| a forgotten `--frozen`, a dropped `--extra`, etc.) trips this guard | ||
| and fails the job with a pointed error message. | ||
| - **uv version pin**: `bootstrap-cudnn-ci` installs a pinned uv version | ||
| via `https://astral.sh/uv/<version>/install.sh` and asserts the | ||
| installed binary matches. The pin is what allows the uv version to | ||
| appear in the cache key prefix without surprise invalidations. | ||
|
|
||
| ## Bumping any of the baseline values | ||
|
|
||
| If you change the container image, CUDA version, Python version, uv | ||
| version, or extras tag, you must update both: | ||
|
|
||
| 1. The matching `env:` value at the top of both | ||
| [.github/workflows/github-nightly-uv.yml](workflows/github-nightly-uv.yml) | ||
| and | ||
| [.github/workflows/github-pr.yml](workflows/github-pr.yml). | ||
| 2. The corresponding literal embedded in `UV_CACHE_KEY_PREFIX` and | ||
| `VENV_CACHE_KEY_PREFIX` (GitHub Actions does not support env-to-env | ||
| references within the same `env:` block, so these are kept in | ||
| lockstep manually). | ||
|
|
||
| The first nightly run after a baseline bump will miss both caches, do a | ||
| full rebuild, and republish under the new prefix. Existing PR workflows | ||
| that pin to the old prefix will hard-fail until they are updated, which | ||
| is the desired behaviour. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,74 @@ | ||
| name: Bootstrap cuDNN CI container | ||
| description: Install OS dependencies and uv in CUDA cuDNN container jobs | ||
| inputs: | ||
| python-version: | ||
| description: Python major.minor expected in the container | ||
| required: false | ||
| default: "3.12" | ||
| uv-version: | ||
| description: | | ||
| Exact uv version to install. Pinning is required because the uv | ||
| version is part of the cache key prefix (a surprise uv upgrade would | ||
| otherwise silently invalidate the wheel store). Bump in lockstep | ||
| with the workflow's UV_VERSION env value. | ||
| required: false | ||
| default: "0.11.7" | ||
| runs: | ||
| using: composite | ||
| steps: | ||
| - name: Install system dependencies | ||
| shell: bash | ||
| run: | | ||
| set -euo pipefail | ||
| export DEBIAN_FRONTEND=noninteractive | ||
| apt-get update | ||
| apt-get install -y --no-install-recommends \ | ||
| ca-certificates \ | ||
| curl \ | ||
| git \ | ||
| gh \ | ||
| build-essential \ | ||
| cmake \ | ||
| pkg-config \ | ||
| python3 \ | ||
| python3-dev \ | ||
| python3-venv \ | ||
| python3-pip \ | ||
| zstd | ||
| ln -sf /usr/bin/python3 /usr/bin/python | ||
| rm -rf /var/lib/apt/lists/* | ||
|
|
||
| - name: Install uv (pinned) | ||
| shell: bash | ||
| env: | ||
| UV_VERSION: ${{ inputs.uv-version }} | ||
| run: | | ||
| set -euo pipefail | ||
| # Pinned installer URL: https://astral.sh/uv/<version>/install.sh | ||
| # ensures the installed binary version matches the cache-key tag. | ||
| curl -LsSf "https://astral.sh/uv/${UV_VERSION}/install.sh" | sh | ||
| # Modern uv (>= 0.5) installs to ~/.local/bin, not ~/.cargo/bin. | ||
| echo "$HOME/.local/bin" >> "$GITHUB_PATH" | ||
|
|
||
| - name: Print toolchain versions | ||
| shell: bash | ||
| env: | ||
| EXPECTED_UV_VERSION: ${{ inputs.uv-version }} | ||
| run: | | ||
| set -euo pipefail | ||
| python3 --version | ||
| uv --version | ||
| # Hard-fail if the installed uv does not match the pin -- otherwise | ||
| # the cache prefix would lie about which uv produced the wheels. | ||
| actual_uv="$(uv --version | awk '{print $2}')" | ||
| if [ "$actual_uv" != "$EXPECTED_UV_VERSION" ]; then | ||
| echo "::error::uv version mismatch: expected ${EXPECTED_UV_VERSION}, got ${actual_uv}" | ||
| exit 1 | ||
| fi | ||
| gcc --version | head -n 1 | ||
| cmake --version | head -n 1 | ||
|
coreyjadams marked this conversation as resolved.
|
||
| if command -v nvcc >/dev/null 2>&1; then | ||
| nvcc --version | ||
| else | ||
| echo "nvcc not found on PATH" | ||
| fi | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General question, does this test trigger any packages to get built from source or is this like core deps and we have binaries for everything?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context, just curious about if this is all the deps needed to install everything (and build deps from source when needed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does trigger some deps to build from source, sometimes. torch_geometric and family, natten both come to mind. transformer_engine will be in the pile eventually too.
It does not do it all the time though: the uv caching will, since uv itself will cache the binaries locally, have the pre-built wheel from last night available tonight, if that makes sense. And the next night, and the next night, and so on until the the cache is invalid or the lock file requires a new build. So the build doesn't trigger everything all the time.
The first build took forever though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And no we're not catching everything yet. I need to get TE for sure, still missing a few others I think. I've got a lot of the deps, though. I was hoping to roll out incrementally from here - a part of the reporting stage was to help ID which tests are skipped due to missing software deps and fix that.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah its the builds form packages from source is what is raising my previous questions about cache invalidation / refreshes.
This is where leaning hard on caching and aggressively using it (and not refreshing it all the time) is very very useful. Initial cache builds for e2s can take hours with e2grid, natten, flash-attn, torch-harmoncis, etc can take hours... something that is useful practically to only do maybe at most once per week imo
managing the cache is uvs problem right. it does not need a fresh cache all the time. creating a new cache really just checks pypi is still alive (I hope so) and source builds still operate as intended. Could be nightly... but when things go wrong... you dont want all new PRs to get stuck building new caches as well because the lock file changed and now your having an issue with TE or other source build