Skip to content

sync: upstream catch-up (133 commits, base 2026-05-26) + re-home fork delta after supervisor split#10

Open
vessux wants to merge 152 commits into
mainfrom
sync/upstream-2026-06-18
Open

sync: upstream catch-up (133 commits, base 2026-05-26) + re-home fork delta after supervisor split#10
vessux wants to merge 152 commits into
mainfrom
sync/upstream-2026-06-18

Conversation

@vessux

@vessux vessux commented Jun 18, 2026

Copy link
Copy Markdown
Owner

Summary

Upstream-sync rebase: fork delta re-parented onto upstream/main (4848c409f23c2c8e, 133 commits absorbed). Per the fork-sync flow this PR is a review artifact, not a merge — it will show CONFLICTING because the rebased branch cannot fast-forward into main. Land via the force-push below.

17 fork-delta commits replayed + 1 adaptation commit. The stale style: cargo fmt commit was dropped (re-formatted in the adaptation pass).

Verification (macOS)

  • cargo build/clippy/test --workspace --features openshell-prover/bundled-z3green, clippy zero warnings, all tests pass (722/723 + ignored gateway/podman integration tests).
  • Proto pre-flight clean: fork fields all in 9000+ (cred_inject=9000, echo=9001, trust_check=9002, volumes=9003, SANDBOX_PHASE_STOPPED=9004); zero upstream collisions in the 8000/9000 range.

Conflict surfaces resolved

Not landing via this PR

Force-push to main is the human step (default-branch protection):

git push origin sync/upstream-2026-06-18:main --force-with-lease

Post-merge: tag vX.Y.0 → release workflow builds binaries → bump OPENSHELL_FORK_TAG in openlock src/sandbox/fork-binaries.ts; smoke-test a fresh gateway on Mac/podman.

mesutoezdil and others added 30 commits May 26, 2026 09:06
…elm-k3s-local (NVIDIA#1539)

macOS ships bash 3.2 which lacks mapfile/readarray. Replace all three
occurrences in configure_ghcr_credentials, cluster_has_image, and
cluster_image_platform with a portable while-read loop, consistent
with the fix applied to docker-build-image.sh in NVIDIA#1334.
Signed-off-by: Ann Marie Fred <afred@redhat.com>
This makes it so you can run the dev gateway and sandbox with:

```
mise run gateway
# in another shell
mise run sandbox
```

Signed-off-by: Kris Hicks <khicks@nvidia.com>
… L4/L7 split (NVIDIA#1412)

* fix(sandbox): add mechanistic smoke test for L4 deny and document the L4/L7 split

The old smoke script exercised an L7 PUT which hung because the denial
aggregator is only wired to L4 CONNECT denies, not L7 enforcement.

Add mechanistic-smoke.sh which triggers an L4 deny, waits for the
aggregator to flush, and asserts a pending chunk appears under
openshell rule get --status pending.

Document the intentional L4-only scope of the mechanistic mapper in
architecture/sandbox.md.

Fixes NVIDIA#1333

Signed-off-by: mesutoezdil <mesudozdil@gmail.com>

* refactor(smoke): remove redundant variable inits and merge double step call

Signed-off-by: mesutoezdil <mesudozdil@gmail.com>

* fix(smoke): wire mechanistic smoke into mise and guard TMP_DIR

- Initialize TMP_DIR before trap to prevent unbound variable on early exit
- Add e2e:mechanistic-smoke mise task with gateway setup
- Document mechanistic smoke in policy-advisor README

* test(proxy): verify L4 deny enqueues a DenialEvent

Signed-off-by: mesutoezdil <mesudozdil@gmail.com>

* fix(proxy): remove unnecessary path qualifications in L4 denial smoke test

---------

Signed-off-by: mesutoezdil <mesudozdil@gmail.com>
Signed-off-by: Kris Hicks <khicks@nvidia.com>
Signed-off-by: Kris Hicks <khicks@nvidia.com>
NVIDIA#1585)

On kernels without Landlock (e.g. gVisor's sentry returns ENOSYS for
syscall 444), the previous best_effort path still logged "Applying
Landlock" + "Landlock ruleset built" events even though no enforcement
was happening. Probe at the top of `landlock::prepare` and short-circuit
with a single High-severity "Sandbox Unavailable" finding.

Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Kris Hicks <khicks@nvidia.com>
Documents the ServiceAccount, Role, and ClusterRole created by the Helm
chart inline on the setup page, per reviewer feedback on NVIDIA#1250. Reflects
the current chart templates including pods/get for sandbox identity and
tokenreviews/create for projected token validation.

Closes NVIDIA#1018
Signed-off-by: mjamiv <142179942+mjamiv@users.noreply.github.com>
)

* feat(gateway): add readiness probe metrics and test-only store close

Emit Prometheus readiness metrics for database probes (healthy gauge and
outcome-labeled latency histogram) with coverage in health HTTP tests.
Restrict Store::close behind test support cfg to prevent accidental runtime
pool shutdown under live traffic.

Signed-off-by: Adrien Langou <alangou@nvidia.com>

* test(e2e): add simple e2e test with kubernetes to test /readyz

Signed-off-by: Adrien Langou <alangou@nvidia.com>

---------

Signed-off-by: Adrien Langou <alangou@nvidia.com>
Signed-off-by: Drew Newberry <anewberry@nvidia.com>
* fix(cli): preserve symlinks during sandbox upload

* docs(sandboxes): document upload symlink behavior
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
… enforce at the router (NVIDIA#1596)

* feat(server): per-handler gRPC auth annotations

Move scope, role, and auth-mode metadata to the handler definition site
via #[rpc_authz] + #[rpc_auth] proc macros. The previously hand-maintained
SCOPED_METHODS, ADMIN_METHODS, UNAUTHENTICATED_METHODS, and
ALLOWED_SANDBOX_METHODS tables are now generated from per-method
annotations on the tonic service impls, with canonical gRPC paths
derived from the service name and method name.

Adds a new openshell-server-macros proc-macro crate, an aggregator in
auth/method_authz.rs, and an exhaustiveness test that decodes the
protobuf FileDescriptorSet (now emitted by openshell-core/build.rs) and
verifies every RPC has an annotation.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

* refactor(server): rename `sandbox-secret` auth mode to `sandbox`

PR NVIDIA#1404 replaced the shared sandbox secret with per-sandbox
gateway-minted JWTs. A handler marked `sandbox` now authenticates as a
specific `Principal::Sandbox`, not as a holder of a shared credential.

Rename `auth = "sandbox-secret"` to `auth = "sandbox"` and
`AuthMode::SandboxSecret` to `AuthMode::Sandbox` so the name matches
the post-NVIDIA#1404 identity model.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

* fix(server): enforce per-handler AuthMode at the router

Addresses review feedback on the per-handler auth-annotation work.

- Router-level enforcement of #[rpc_auth] auth mode (HIGH). The previous
  router only checked is_sandbox_callable() for Principal::Sandbox; user
  principals still flowed into AuthzPolicy::check() and bypassed the
  per-handler declaration. A user with `openshell:all` could therefore
  reach `sandbox`-only handlers like GetSandboxProviderEnvironment,
  ReportPolicyStatus, PushSandboxLogs, and SubmitPolicyAnalysis even
  though their annotations said sandbox-only. Adds an
  is_user_callable() predicate and rejects User principals at the
  router for `sandbox` / `unauthenticated` methods.

- Proc macro now errors on duplicate keys in #[rpc_auth(...)] (LOW). A
  second `auth`, `scope`, or `role` previously silently overwrote the
  first value; now it fails to compile.

- Regression tests: a unit test for is_user_callable() and a router
  test that proves a user with admin role + openshell:all cannot reach
  the nine sandbox-only handlers.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

* docs(server): finish renaming sandbox-secret to sandbox in method_authz doc comments

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

* refactor(server-macros): drop standalone `rpc_auth` stub

The stub was a safety net that fired only when a method had
`#[rpc_auth(...)]` without an enclosing `#[rpc_authz]`. Triggering it
required `rpc_auth` to be imported, which is why both call sites carried
`#[allow(unused_imports)] use openshell_server_macros::{rpc_auth, rpc_authz};`.

Drop the stub and the unused-import workaround. A missing `#[rpc_authz]`
now surfaces as rustc's standard "cannot find attribute `rpc_auth` in
this scope" — clear enough, and one fewer import + lint exception.

Addresses review comment on PR NVIDIA#1596.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

* refactor(server-macros): emit fixed `AUTH_METADATA` const per service

The previous trait-derived const name turned `OpenShell` into
`OPEN_SHELL_AUTH_METADATA`, splitting the project name across an
underscore. Each impl already lives in its own module
(`crate::grpc::`, `crate::inference::`), so the module path is enough
to disambiguate between services — a fixed `AUTH_METADATA` name reads
more naturally.

Aggregator in `auth/method_authz.rs` now references
`crate::grpc::AUTH_METADATA` and `crate::inference::AUTH_METADATA`
directly.

Addresses review comment on PR NVIDIA#1596.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

* docs(server-macros): fix typo in AUTH_METADATA_CONST doc comment

OpenShell is one word; reference name in the doc should be
OPENSHELL_AUTH_METADATA, not OPEN_SHELL_AUTH_METADATA.

Addresses review nit on PR NVIDIA#1596.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

---------

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
…IA#1615)

- Extract landing-page terminal demo into a reusable <CommandTerminal />
  component with inline styles (no global CSS dependency)
- Animate a second command line cycling through claude/opencode/codex
  via @Keyframes scoped inside the component
- Inline BadgeLinks layout styles so the component renders correctly
  without relying on .badge-links from main.css
- Add jsx.d.ts shim so editors do not flag the React global in component
  TSX files
- Switch fern instance to global-theme: nvidia with multi-source enabled
- Bump fern CLI to 5.40.0 and drop the basepath-aware experimental flag
- Register fern/components/ as a second mdx-components directory
- Remove the unused Adobe analytics script tag
Signed-off-by: Calum Murray <cmurray@redhat.com>
Signed-off-by: Calum Murray <cmurray@redhat.com>
* feat(helm): add optional PostgreSQL backing store with Secret-based credentials

- Add postgres.enabled and postgres.deploy values to control database
  backend (SQLite vs PostgreSQL) and subchart deployment independently.
- Introduce db-secret.yaml template for Opaque Secret with assembled
  postgresql:// connection string injected via OPENSHELL_DB_URL env var.
- Add Bitnami PostgreSQL as optional subchart dependency keyed on
  postgres.deploy to prevent subchart deployment in external mode.
- Externalize JWT signing key file mode via sandboxJwt.secretDefaultMode
  with 0400 default matching upstream.
- Add validation guard for postgres.deploy=true without postgres.enabled.
- Add helm unit tests covering internal, external, URL-override, special
  character encoding, and misconfiguration error paths.
- Update README with Kubernetes and OpenShift install examples for
  bundled and external PostgreSQL configurations.
- Add helm dependency build to lint and unittest tasks.

* fix(helm): add database backend docs to README.md.gotmpl and regenerate

The helm-docs CI check failed because the Database backend section was
added directly to README.md instead of README.md.gotmpl. Move the
content to the template and regenerate so the check passes.

* fix(helm): use Secret-based DB credentials and support existingSecret

Replace the inline db-url stringData pattern with a proper Secret
containing individual fields plus a uri key.  When postgres.deploy=true
the Bitnami service-binding secret is referenced directly; when
deploy=false users can supply postgres.external.existingSecret to
bring their own Secret, or let the chart generate one from the external
field values.

Also restructures the README database section for clarity, adds
helm-unittest coverage for the new secret resolution paths, and
fixes a markdown lint issue in the root README.

* refactor(helm): move OpenShift e2e script to e2e/rust/ and add mise task

Move test-openshift-scenarios.sh from deploy/helm/openshell/ci/ to
e2e/rust/e2e-openshift.sh, matching the existing e2e script naming
convention. Register it as `e2e:openshift` in tasks/test.toml — not
wired into the `test` or `e2e` aggregates so it only runs on explicit
invocation against a live OpenShift cluster.

* feat(e2e): add database backend scenarios to Kubernetes e2e

Extend with-kube-gateway.sh with an optional multi-scenario loop gated
by OPENSHELL_E2E_KUBE_DB_SCENARIOS=1. When enabled, the script installs
the Helm chart three times — SQLite (default), bundled PostgreSQL, and
external PostgreSQL with existingSecret — running the full test suite
against each backend. When unset, existing single-install behavior is
unchanged.

Also adds helm dependency build before helm install, fixing CI failures
caused by the missing PostgreSQL subchart dependency.

* refactor(helm): simplify PostgreSQL config to two orthogonal controls

Replace postgres.deploy and postgres.external.* with two simple controls:
- postgres.enabled: deploy the bundled Bitnami PostgreSQL subchart
- server.externalDbSecret: name of a pre-existing Secret with a uri key

Delete db-secret.yaml — the chart no longer generates Secrets from
individual credential fields. Users either get the Bitnami service-binding
secret (bundled) or bring their own via server.externalDbSecret.

Add validation that postgres.serviceBindings.enabled must stay true
when using bundled PostgreSQL, preventing a confusing runtime failure.
* feat(build): add simple nix flake with formatter for nix code

* feat(flake): setup rust toolchain, able to build and run unit tests

* feat(flake): add support for arm linux and macos

* feat(toolchain): add rust-src and rust-analyzer to the toolchain
…VIDIA#1565)

* refactor(proto): move phase and current_policy_version into SandboxStatus

Move phase and current_policy_version from SandboxSpec into
SandboxStatus to correctly model mutable runtime state. Update all
callers in the gateway server, TUI, and Python SDK to read and write
these fields through SandboxStatus accessors.

Signed-off-by: Derek Carr <decarr@redhat.com>

* fix(server): preserve sandbox status on statusless driver updates

When a driver update arrives without a status payload (e.g. before
Kubernetes populates the status subresource), preserve the stored
phase, conditions, and current policy version instead of resetting
them. Adds a regression test covering the edge case.

Signed-off-by: Derek Carr <decarr@redhat.com>

---------

Signed-off-by: Derek Carr <decarr@redhat.com>
)

* feat(python-sdk): support OIDC Bearer auth on SandboxClient

PR NVIDIA#1596 hardened the gateway side of the OIDC story; the Python SDK
was the remaining gap — it only supported plaintext or mTLS, with no
Bearer metadata anywhere. Deployments with OIDC enabled (the
recommended posture since PR NVIDIA#935 / PR NVIDIA#1404) were unreachable from
the SDK.

Adds:

- `bearer_token: str | Callable[[], str] | None` kwarg on
  `SandboxClient`. Static strings or zero-arg callables (the latter
  is invoked per RPC, so callers can drop in a refresh loop or
  token-file watcher without reconstructing the client). Composes
  with `tls` for OIDC-over-mTLS deployments.
- `_BearerAuthInterceptor` implementing all four
  `grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types.
  Appends `authorization: Bearer <token>` to outgoing metadata.
  Implemented as an interceptor (not call credentials) so it works
  on both plaintext (`disableTls=true` dev) and TLS channels without
  `grpc.composite_channel_credentials`.
- `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`,
  `key_path`) are now optional with `cert_path` / `key_path`
  required-together-or-not-at-all (enforced in `__post_init__`). This
  unlocks three transport profiles from one dataclass:
    * full mTLS (all three)
    * CA-only trust (`ca_path` only)
    * system roots (`TlsConfig()` — for OIDC gateways behind a
      public CA)
- `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs`
  `build_oidc_channel`:
    * For any `https://` gateway, always build a secure channel.
      Pick the strongest TLS profile available in `mtls/` (full
      mTLS → CA-only → system roots). No more `insecure_channel`
      fallback for HTTPS.
    * Gate OIDC bearer attachment on
      `metadata.json["auth_mode"] == "oidc"`. Matches
      `crates/openshell-cli/src/main.rs:132` and the TUI; a stale
      `oidc_token.json` next to a non-OIDC gateway no longer causes
      the SDK to attach a bearer.
- `_OidcRefresher` — thread-safe, in-process native OAuth2 refresh
  modeled on `google.oauth2.credentials.Credentials` and
  `botocore.tokens.SSOTokenProvider`. Lazily checks expiry on every
  RPC; when stale, re-reads disk first (the CLI may have rotated
  the bundle), and only then exchanges the refresh_token against
  the IdP's token endpoint discovered via OIDC discovery
  (`/.well-known/openid-configuration`, cached after first call).
  Concurrent RPCs share a single refresh via `threading.Lock` (no
  IdP stampede). Honors refresh-token rotation. Surfaces IdP
  failures as `SandboxError` with the RFC 6749 error body included
  for diagnostics.

  Mirrors the Rust CLI's HTTP-policy posture from
  `crates/openshell-cli/src/oidc_auth.rs`:
    * `follow_redirects=False` so a 3xx during discovery can't
      steer us to an attacker-controlled token endpoint.
    * Discovery `issuer` is validated against the configured
      issuer; a discovery document claiming a different issuer is
      rejected, preventing the SDK from POSTing the refresh_token
      to a malicious endpoint.
    * `insecure: bool` flag plumbed through to httpx's
      `verify=` so self-signed-cert deployments work the same way
      they do in the Rust CLI.

  Built on `httpx` (chosen over `urllib` specifically for
  follow_redirects + verify control as kwargs). The OAuth2
  refresh-token grant itself (RFC 6749 §6) is one form-encoded
  POST — handled inline rather than via a dedicated OAuth library;
  tried `authlib`'s `OAuth2Client` first but it auto-injects an
  Authorization header on every request, which breaks the
  unauthenticated discovery GET.
- `_make_cluster_bearer_provider(..., auto_refresh=True,
  write_back=True, insecure=False)` factory. Defaults to the
  refresher path with write-back enabled; `auto_refresh=False`
  falls back to the read-only fail-closed behavior for callers that
  don't want the SDK to make outbound HTTP calls to the IdP.

  `write_back=True` is the default (changed from the first round of
  review): IdPs with refresh-token rotation (Keycloak with
  rotation, Entra in strict mode) invalidate the old refresh_token
  on each refresh, so an in-memory-only refresh would leave the
  on-disk bundle pointing at an invalidated value — any second
  process starting from disk would `invalid_grant`. With write-back
  enabled by default, the SDK keeps the shared cache consistent
  with the IdP.
- `from_active_cluster` exposes `auto_refresh`, `write_back`, and
  `insecure` kwargs (defaults: True / True / False). The
  high-level `Sandbox` context manager surfaces the same three
  kwargs and forwards them through, so callers using the wrapper
  have parity with `SandboxClient` for OIDC-protected gateways.
- `SandboxClient.close()` chains to a `_bearer_close` hook so the
  `_OidcRefresher`'s underlying `httpx.Client` is released
  deterministically instead of leaking sockets/FDs until GC runs
  `__del__`. Idempotent.
- `_OidcRefresher._write_to_disk` uses `tempfile.mkstemp` (PID +
  random suffix) instead of a fixed `.oidc_token.json.tmp` path,
  so two writers racing on the same gateway directory don't
  trample each other's tmp content. Success path atomically
  replaces; failure path unlinks the orphan.

OAuth2 refresh policy and write-back semantics deliberately mirror
what the major Python SDKs do — see
github.com/googleapis/google-auth-library-python (`Credentials`)
and github.com/boto/botocore (`SSOTokenProvider`):

| Library                       | Native refresh | Writes back |
|-------------------------------|----------------|-------------|
| google-auth Credentials       | yes            | no          |
| botocore SSOTokenProvider     | yes            | yes         |
| openshell SandboxClient (here)| yes (opt-out)  | yes (opt-out)|

OpenShell sits between the two; chose write-back-by-default because
the rotation invariant matters more for our deployments than the
"CLI is the only writer" assumption that fits google-auth.

Adds `httpx>=0.27` as a runtime dependency. No new OAuth2 library —
the refresh grant is a single POST.

Tested:

- 42 sandbox_test.py tests pass (5 pre-existing + 37 new across
  the bearer interceptor, fail-closed provider, refresher
  behavior, TlsConfig validation, from_active_cluster auth ladder,
  security-review regressions, Sandbox-wrapper kwarg forwarding,
  and lifecycle / concurrency probes).
  `mise run test:python` → 47 passed total across the python
  suite.
- `mise run python:lint` (ruff) clean.
- End-to-end against a Keycloak-protected gateway on OpenShift:
    * unauthenticated `Health` bypass works
    * admin + `openshell:all` reaches user-callable methods
    * reader (`sandbox:read`) denied on `CreateSandbox` by scope
    * admin + `openshell:all` denied on PR NVIDIA#1596 sandbox-only
      methods at the router (the new gate is honored from the SDK)
    * full provider CRUD lifecycle via the SDK
    * callable token provider rotates per RPC as expected
- Regression-probed against three pre-review security findings:
    * **Discovery issuer validation** — a discovery document
      claiming a different `issuer` than the configured one is
      rejected with a clear `SandboxError` before any refresh POST
      can reach the attacker-controlled endpoint.
    * **Redirect during discovery** — `follow_redirects=False` on
      the underlying httpx client means a 3xx during discovery
      surfaces as a SandboxError rather than silently chasing the
      redirect.
    * **Cross-process rotation** — a two-process simulation shows
      process B starting from disk and successfully refreshing
      with the rotated refresh_token, because process A's
      write-back updated the shared cache.
- Refresher unit tests cover: cached-fresh fast path, disk-rotated
  re-read before refresh, OAuth2 exchange against the discovered
  token endpoint, refresh-token rotation, atomic write-back at
  0600 mode (default), default-on write_back proven by test,
  concurrent N-thread coordination (one refresh shared across 8
  threads), IdP failure surfaced with the error body, the
  client_credentials / no-refresh_token error path, issuer-
  mismatch rejection, redirect-during-discovery rejection,
  insecure flag plumbing.
- Lifecycle / concurrency regression tests added: `close()`
  invokes the `_bearer_close` hook (idempotent), the refresher's
  `httpx.Client` is marked closed after `SandboxClient.close()`,
  and 16 concurrent writers don't leave orphan tmp files behind
  while producing a valid final bundle. The `Sandbox` wrapper has
  direct forwarding tests proving `auto_refresh`, `write_back`,
  and `insecure` reach `from_active_cluster` (both explicit
  values and defaults).
- End-to-end against a real OpenShift + Keycloak cluster from
  inside a pod: real OIDC discovery against
  `keycloak.keycloak.svc.cluster.local:8080`, refresh-token grant
  POST, atomic write-back of the rotated bundle at 0600, and a
  follow-up RPC reusing the freshly-rotated in-memory token —
  full round-trip in ~170ms.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

* fix(python-sdk): adopt newer on-disk OIDC bundle before refreshing

_OidcRefresher.current_access_token() only adopted the on-disk
oidc_token.json when its access token was still fresh; otherwise it
refreshed using the in-memory bundle. With refresh-token rotation
enabled (Keycloak with rotation, Entra strict mode), this let a process
keep using an invalidated refresh_token:

1. Process A holds a stale in-memory bundle with refresh_token=r1.
2. Process B refreshes first and writes a rotated (r2) but now
   near-expiry bundle to disk.
3. Process A re-reads disk, sees the access token is not fresh, ignores
   the disk bundle, and POSTs the stale r1 — which the IdP has already
   invalidated, yielding invalid_grant.

Fix: when the cached bundle is stale, adopt the on-disk bundle if it was
refreshed more recently than ours, even when its access token is also
stale. "More recently" is decided by expires_at — a refresh mints a new
access token with a forward expiry alongside the rotated refresh_token,
so the later expiry carries the newest refresh_token. Comparing by
expiry (rather than unconditionally preferring disk) preserves the
write_back=False case, where the in-memory bundle has already rotated
past the on-disk copy and must not be clobbered. When the adopted
bundle's issuer differs, the cached token endpoint is reset so the
refresh re-discovers against the new issuer.

Adds regression tests for the cross-process rotation race and the
issuer-change re-discovery path.

* fix(python-sdk): recover from invalid_grant on lost rotation race

The expiry-based disk re-read narrows but does not fully close the
cross-process refresh-token rotation race: two processes sharing a
gateway directory can both enter their refresh window, both POST their
copy of the refresh_token, and with rotation enabled the IdP invalidates
the loser's token (invalid_grant). Neither google-auth nor botocore
close this window without an OS file lock; a Python-only flock would not
coordinate with the Rust CLI/TUI that also write oidc_token.json, so
locking is not worth its cost here.

Recover instead of prevent: distinguish an OAuth2 invalid_grant (the
refresh_token was rejected) from transport/5xx failures via a private
_InvalidGrantError, and on invalid_grant re-read oidc_token.json once. If
a peer wrote a different refresh_token (it won the race), adopt and retry
with it — returning early if it is already fresh — so the loser succeeds
transparently instead of forcing a re-authenticate. If disk offers no new
token, the rejection is genuine and surfaces the re-authenticate hint as
before. The retry is single-shot; a second invalid_grant propagates.

Adds tests for the peer-rotation recovery and the genuine-rejection
(no-retry) paths.

---------

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
…#1627)

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
…1623)

Rootless Podman sandbox containers reach the host through pasta's local
connection bypass, which translates L2 frames to L4 host sockets. The
dev gateway script binds to 127.0.0.1 by default, which is not routable
through pasta. Auto-detect rootless mode and bind to 0.0.0.0 so sandbox
containers can connect to the gateway.

- Auto-detect rootless Podman in gateway.sh and export
  OPENSHELL_BIND_ADDRESS=0.0.0.0 when not explicitly set
- Add e2e:podman:rootless mise task and CI matrix entry to validate
  rootless Podman networking end-to-end
- CI creates a non-root user inside the privileged container to trigger
  Podman's rootless code paths (pasta, user namespace isolation)

Signed-off-by: Naveen Malik <nmalik@redhat.com>
elezar and others added 30 commits June 16, 2026 10:14
Prefer a single CDI-qualified device when Docker or Podman resolves the default GPU request to one GPU.

Allow nvidia.com/gpu=all only as a WSL2 all-only compatibility fallback, using Docker daemon info and Podman's /dev/dxg probe to identify that case.

Update driver docs, architecture notes, and GPU e2e coverage for the default selection behavior.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
* feat(server): support TLS certificate hot-reload

Signed-off-by: Yuedong Wu <dwcn22@outlook.com>

* refactor(server): extract shared TLS test utilities

Signed-off-by: Yuedong Wu <dwcn22@outlook.com>

---------

Signed-off-by: Yuedong Wu <dwcn22@outlook.com>
…IA#1902)

* feat(providers): add DeepInfra as a built-in inference provider (v2 only)

- Adds `deepinfra` as a built-in Providers v2 profile (`providers/deepinfra.yaml`)
  with inference category, Bearer auth, and `DEEPINFRA_API_KEY` discovery
- Adds `DEEPINFRA_PROFILE` to inference routing so `inference.local` works
  with the `deepinfra` provider type
- Fixes `build_backend_url` to strip `/v1` from request paths when the base
  URL contains `/v1/` as an internal segment (e.g. `api.deepinfra.com/v1/openai`),
  preventing double-versioned paths like `.../v1/openai/v1/chat/completions`
- Updates `docs/sandboxes/providers-v2.mdx` and `docs/sandboxes/manage-providers.mdx`
  with DeepInfra entries; removes the old v1 workaround row that used `openai`
  type with `OPENAI_API_KEY`

Signed-off-by: Milos Milutinovic <codemastermilos@gmail.com>

* fix(providers): address gator review findings for DeepInfra provider

- Narrow build_backend_url /v1 dedupe to URLs whose path component is
  exactly /v1 or starts with /v1/ — prevents regression on proxy
  endpoints where /v1 is buried deeper (e.g. /api/v1/openai); add
  regression test for the nested proxy path case
- Add deepinfra provider plugin with DEEPINFRA_API_KEY discovery,
  registered in ProviderRegistry so known_types() and TUI include it
- Add deepinfra to unsupported-inference-provider error message in
  openshell-server for accurate user-facing debugging guidance
- Add deepinfra to openai_compatible_profiles_include_embeddings test
  to lock in the OpenAI-compatible protocol contract

Signed-off-by: Milos Milutinovic <codemastermilos@gmail.com>

* fix(router): handle /v1 as final path segment in build_backend_url dedup

Extends the /v1 deduplication logic to also strip /v1 from request paths
when the base URL's path ends with /v1 (e.g. https://api.groq.com/openai/v1).
The previous fix only matched paths starting with /v1/, which regressed
providers like Groq whose base path has /v1 as the last segment rather than
the first. The nested-proxy exclusion (e.g. /api/v1/openai) is preserved
since /v1 appears in the middle — neither first nor last segment. Adds a
regression test for the Groq-style base URL.

Signed-off-by: Milos Milutinovic <codemastermilos@gmail.com>

* fix(providers): add deepinfra telemetry bucket and update profile list test

- Add DeepInfra variant to ProviderProfile telemetry enum and from_raw()
  mapping so deepinfra providers are tracked in their own bucket rather
  than falling through to Custom
- Map deepinfra in telemetry_provider_profile() in openshell-server
- Add deepinfra to list_provider_profiles_returns_built_in_profile_categories
  test (sorted between cursor and github)
- Update architecture/gateway.md inference provider list to include deepinfra

Signed-off-by: Milos Milutinovic <codemastermilos@gmail.com>

* style(router): apply cargo fmt to backend.rs

Signed-off-by: Milos Milutinovic <codemastermilos@gmail.com>

---------

Signed-off-by: Milos Milutinovic <codemastermilos@gmail.com>
* docs(rfc): require issues before RFCs

Signed-off-by: Drew Newberry <anewberry@nvidia.com>

* docs(rfc): correct RFC statuses

Signed-off-by: Drew Newberry <anewberry@nvidia.com>

* docs(rfc): mark template accepted

Signed-off-by: Drew Newberry <anewberry@nvidia.com>

* Update rfc/README.md

Co-authored-by: krishicks <khicks@nvidia.com>

* docs(rfc): document accepted RFC project tracking

Signed-off-by: Drew Newberry <anewberry@nvidia.com>

* docs(rfc): reframe RFC discussion guidance

Signed-off-by: Drew Newberry <anewberry@nvidia.com>

* docs(rfc): label issues when assigning RFCs

Signed-off-by: Drew Newberry <anewberry@nvidia.com>

* docs(rfc): link labeled RFC issues to board

Signed-off-by: Drew Newberry <anewberry@nvidia.com>

* docs(rfc): clarify RFC labels and board tracking

Signed-off-by: Drew Newberry <anewberry@nvidia.com>

---------

Signed-off-by: Drew Newberry <anewberry@nvidia.com>
Co-authored-by: krishicks <khicks@nvidia.com>
…IA#1929)

Signed-off-by: Drew Newberry <anewberry@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This fixes an issue where you may run e.g. `mise run e2e:python`, then after
Python is upgraded in mise.toml, subsequent runs of `e2e:python` fail because
the Python version is out of sync.

Signed-off-by: Kris Hicks <khicks@nvidia.com>
* feat: build CLI during pull request

Fixes NVIDIA#1454

Signed-off-by: Jeff MAURY <jmaury@redhat.com>

* fix: removed secrets passing

Signed-off-by: Jeff MAURY <jmaury@redhat.com>

* fix: fix wrong conflict resolution

Signed-off-by: Jeff MAURY <jmaury@redhat.com>

* fix: apply suggestion from @TaylorMutch

Co-authored-by: Taylor Mutch <taylormutch@gmail.com>

* fix: remove doubled quote

Signed-off-by: Jeff MAURY <jmaury@redhat.com>

* fix: sync mise.lock

Signed-off-by: Jeff MAURY <jmaury@redhat.com>

---------

Signed-off-by: Jeff MAURY <jmaury@redhat.com>
Co-authored-by: Taylor Mutch <taylormutch@gmail.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
* fix(helm): build chart dependencies before lint

Signed-off-by: Evan Lezar <elezar@nvidia.com>

* test(e2e): remove python gpu smoke test

Remove the Python GPU smoke test and its fixture. The e2e:k3s:gpu task only depended on e2e:python:gpu and did not have a separate k3s implementation, so remove that stale alias with the task it pointed at.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
(cherry picked from commit 221a103)

---------

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Five-feature stack for outbound credential management and
package-trust evaluation, used by openlock as its sandbox runtime:

- cred_inject: policy-declared header strip-and-replace for outbound
  HTTP requests. Sandboxes never see real provider tokens; they
  resolve openshell:resolve:env:* sentinels at policy time.
- per-binary credential scoping: allowed_secrets on network rules
  gates which credentials each /usr/bin/<x> can resolve. Prevents
  /usr/bin/gh from seeing ANTHROPIC_API_KEY etc.
- ArcSwap live refresh: lock-free hot-swap of provider credentials
  at runtime via background poll. In-flight requests see snapshot.
- echo endpoint: policy-driven echo: true short-circuits upstream
  connect; returns post-rewrite headers as JSON for offline wire
  proofs.
- trust_check hook: pre-flight package-registry trust evaluation
  via deps.dev (CVSS, license, staleness). Rego rules deny on
  critical CVEs; audit on high; fail-open on lookup error.

This is the source-of-truth squash for openlock's openshell fork
delta. Per-feature design lives in
openlock/docs/superpowers/specs/{2026-05-01..03}-*.md.
Builds openshell-gateway, openshell-sandbox, and openshell CLI on
public GitHub-hosted runners (ubuntu-24.04, ubuntu-24.04-arm,
macos-14) and attaches them to a GitHub Release on tag push.

Distinct from upstream's release-tag.yml, which targets NVIDIA's
self-hosted infrastructure. Bundled-z3 used for the CLI to avoid
system z3 dependency on each runner.
Bundled-z3 vendors a z3 source tree whose obj_hashtable.h hits
overload-resolution errors with the clang on macos-14 (Apple clang)
and likely modern Linux clang too. Install libz3-dev / brew z3 and
build the CLI against the system z3 instead.
Avoids the manual 'gh release edit --prerelease' step on every RC.
softprops/action-gh-release reads the prerelease flag from inputs;
gate on tag-name suffix detection. Mirrors openlock f95c3e6.
…ce struct_excessive_bools

After the upstream sync rebase, NetworkPolicyRule's fork-added
allowed_secrets field was missing from 13 test-fixture initializers
in merge.rs. NetworkEndpointDef in lib.rs now has >3 bool fields
(fork's echo + upstream's websocket/body credential rewrite flags);
silence the clippy::struct_excessive_bools lint — these are
independent feature flags, not a state-machine candidate.

Also simplify three raw string literals (remove unnecessary hashes):
the YAML content in allowed_secrets, cred_inject, and trust_check
tests doesn't need the escape-hashes.
After the upstream sync rebase, test code across openshell-sandbox
and openshell-server constructs NetworkPolicyRule, NetworkEndpoint,
L7EvalContext, and L7EndpointConfig without the fork-added fields
(allowed_secrets, cred_inject, echo, trust_check, trust_cache).
Add safe defaults. Also remove dead fetch_provider_environment
helper (unused after ArcSwap poll-loop removal) and unused
arc_swap import; pass None to the trust parameter of
evaluate_l7_request at one more call site.
)

* feat(cli): add --volume flag to sandbox create

Register a repeatable `--volume <HOST>:<CONTAINER>[:ro]` clap arg on
`SandboxCommands::Create`. Adds an `#[allow(clippy::large_enum_variant)]`
to suppress the size-difference lint that fires now that Create has
multiple Vec fields.

The new field is destructured as `_volumes` in the dispatch match arm;
plumbing through the create flow is deferred to a follow-up task.

* refactor(cli): tighten --volume flag test (in-process) + verify clippy allow

- Replace dead openshell_bin() fallback with env!("CARGO_BIN_EXE_openshell")
  so a missing binary fails at compile time rather than panicking at runtime
  with a space-in-path exec error.
- Strengthen the three parse-acceptance tests: assert exit code == 1 (no
  gateway configured) rather than != 2, pinning the exact failure mode and
  ruling out silent success (exit 0) as well as clap errors (exit 2).
- Drop OPENSHELL_ENDPOINT env var from parse tests; the "No active gateway"
  runtime check fires first (exit 1) before any network I/O.
- Add module-level doc comment explaining why subprocess is retained over
  in-process (Cli is private; run::sandbox_create bypasses clap).
- Verify large_enum_variant clippy allow: lint fires at 297 bytes (Create)
  vs 78 bytes (next variant). Update comment to cite actual sizes instead
  of "by design".

* feat(cli): parse --volume HOST:CONTAINER[:ro] specs into typed struct

Introduces volume_spec module with BindVolumeSpec, VolumeParseError, and
parse_volume_spec, validated by 7 unit tests covering both happy paths and
all error variants (bad field count, bad ro token, non-absolute/missing host,
non-absolute container).

* feat: plumb --volume specs from CLI through proto to DriverSandbox

Add BindVolume message to openshell.proto (SandboxSpec.volumes = 11)
and compute_driver.proto (DriverSandboxSpec.volumes = 11). Wire parsed
BindVolumeSpec entries from the CLI handler through run::sandbox_create
into the CreateSandboxRequest, and map them to DriverBindVolume in
driver_sandbox_spec_from_public on the server side. Fix the docker
driver test fixture and integration test call sites for the updated
sandbox_create signature.

* feat(podman): emit bind mounts + auto userns=keep-id when volumes present

Append a libpod bind-mount entry for each DriverSandboxSpec.volume, and
set userns=keep-id:uid=N,gid=N (from the image USER directive) when any
volumes are present. Falls back to uid/gid 1000660000 for unset or
non-numeric USER directives.

* refactor(podman): tighten image_user parsing + extract COMMUNITY_SANDBOX_UID

Extract the magic number 1_000_660_000 into a named pub constant
COMMUNITY_SANDBOX_UID so all four usage sites share a single source
of truth with a doc-comment linking to the Dockerfile origin.

Fix a split-uid bug in image_user: when Config.User is ":1000", the
uid field is "" which fails to parse as u32. Previously gid was still
extracted from parts[1], producing a mismatched (1_000_660_000, 1000)
pair. Now the function returns the fallback pair immediately on any
uid parse failure, matching the documented contract.

Expand the image_user doc-comment to cover the uid=0/root case
(keep-id:uid=0,gid=0 is harmless on rootless Podman), the empty-user
case, and the non-numeric-user case.

Add an rbind/read-write assertion on bind_mounts[0] in the existing
build_container_spec_emits_bind_mount_entries test.

* feat(drivers): docker -v passthrough + vm bind reject for --volume

Extend DockerDriver's build_binds to append user-declared bind volumes
from sandbox.spec.volumes using host:container[:ro] format. Extend
validate_vm_sandbox to reject requests that carry any volumes with
Status::invalid_argument.

* docs(volume): e2e tests + podman driver README + user docs for --volume

Add two e2e tests for `openshell sandbox create --volume` (rw round-trip
and ro write-block) gated behind the new `e2e` Cargo feature. Add bind-volume
section to the podman driver README documenting auto userns-remap. Add
`--volume` flag documentation and userns-remap note to the user-facing
manage-sandboxes page and the compute-runtimes architecture doc.
required-ci-gates.yml posts five "Waiting for /ok to test mirror"
pending statuses on every PR (Branch Checks, Helm Lint, Branch E2E
Checks, GPU Test, Branch Kubernetes E2E). These never resolve on this
fork because two pieces of upstream infrastructure are missing:

- No `/ok to test <SHA>` issue-comment handler exists in our
  .github/workflows/. Upstream runs it as an external GitHub App
  that we cannot import.
- No fork-side self-hosted runners are registered, so the worker
  workflows (branch-checks.yml, helm-lint.yml, branch-e2e.yml,
  test-gpu.yml, branch-kubernetes-e2e.yml) cannot execute even when
  manually dispatched: they target `linux-amd64-cpu8` /
  `linux-arm64-cpu8` and a private `ghcr.io/nvidia/openshell/ci`
  container image.

e2e-label-help.yml is the partner workflow that posts a hint comment
when a `test:e2e*` label is applied. Without the `/ok` handler the hint
points at a command that does nothing on this fork.

Delete both. The worker workflows stay in place: they are inert
(unrunnable for the reasons above) and removing them would create a
larger surface to merge back when upstream changes.

Effect: PRs no longer show two perpetually-yellow checks. The fork's
real CI signal (CodeQL, openlock-release, ci-image, etc.) is
unaffected.
The driver-layer stop_sandbox already exists for every compute backend
(podman, docker, kubernetes, vm) and the internal compute_driver.proto
exposes a StopSandbox RPC. But the gateway-facing openshell.proto only
surfaces DeleteSandbox, which destroys the workspace volume alongside
the container. Operators who want to halt a sandbox for resource reasons
have to delete it and lose state.

This patch adds:

- StopSandbox and StartSandbox RPCs to proto/openshell.proto with
  matching Request/Response messages.
- Compute-runtime wrappers (stop_sandbox, start_sandbox) that look up
  the sandbox by name, forward stop to the driver, and forward start
  to the StartupResume hook so the existing Docker resume path is
  reused.
- A resume_sandbox method on PodmanComputeDriver mirroring the Docker
  implementation, plus a StartupResume impl wired in new_podman so the
  Start verb works on podman too.
- gRPC handlers (handle_stop_sandbox, handle_start_sandbox), trait
  wires, and matching authz entries (both gated on sandbox:write).
- CLI verbs: openshell sandbox stop [--all] and openshell sandbox start
  with completion for sandbox names and the same forward-cleanup
  behavior as delete.

Phase is left to the existing watch loop to update as containers
transition; no new SandboxPhase variant is required. Restart-from-
stopped works mid-session via Start and at gateway boot via the
existing resume_persisted_sandboxes sweep.

Tests:

- Podman driver: resume_sandbox returns Ok(false) when container is
  missing, is a no-op when already running, and issues POST /start
  when stopped.
- Compute wrappers: NotFound for missing sandboxes, Unimplemented when
  no StartupResume hook is configured, forwards to the hook with the
  correct id/name, surfaces the hook's Ok(false) verdict.
- gRPC handlers: empty-name InvalidArgument and missing-sandbox
  NotFound for both Stop and Start.

The seven mock OpenShell impls in integration tests are updated to
satisfy the expanded trait surface.
…rom crash (#5)

Add a sixth `SandboxPhase` variant so callers can tell an explicit
`openshell sandbox stop` apart from a container that crashed. Previously
both surfaced as `Error`, leaving consumers (the openlock reattach path
in particular) unable to decide whether to resume or surface the
failure.

Implementation: `stop_sandbox` stamps an `openshell.io/stop-requested`
label on the persisted sandbox before invoking the driver; the watch
loop's `apply_sandbox_update_locked` then maps the ensuing
`ContainerExited` Ready=false condition to `Stopped` instead of
`Error` when the label is present. `start_sandbox` clears the label
only after a successful resume so failed resumes don't get misclassified.

The supervisor-session reconciler skips sandboxes already in `Stopped`
to avoid silently reanimating them.

Six unit tests cover the new behavior: stamping, clearing, persistence
when backend is missing, the error→stopped override, and the no-label
case still surfacing as Error.
The cred-inject fields already use the 9000+ range reserved for openlock
fork additions, but the later bind-mount `volumes` field and the
`SANDBOX_PHASE_STOPPED` enum value grabbed the next sequential numbers
(11 and 6). That put them in upstream's path: upstream's per-sandbox-auth
work (NVIDIA#1404) took DriverSandboxSpec field 11 (`sandbox_token`), colliding
with `volumes`, and upstream could extend SandboxPhase past 5 at any time.

Move both into the fork range so the delta is permanently collision-proof
and the convention is uniform:
  - SandboxSpec.volumes        11 -> 9003  (openshell.proto)
  - DriverSandboxSpec.volumes  12 -> 9003  (compute_driver.proto)
  - SANDBOX_PHASE_STOPPED       6 -> 9004  (openshell.proto)

Safe to renumber: `volumes` is transient provisioning input and the phase
is re-derived from backend state on every watch (never persisted as the
enum int), so no gateway-DB upgrade-path break. Gateway, CLI, and sandbox
binaries always ship as one matched fork tag, so there is no wire skew.
The macOS release CLI dynamically linked Homebrew's libz3.4.15.dylib
(Z3_SYS_Z3_HEADER=/opt/homebrew/include/z3.h), so it failed at startup on
a clean Mac with no brew z3 (dyld: Library not loaded). Mirror the Linux
job instead: build z3 from source and statically link it via
--features bundled-z3, using zig as the C/C++ compiler so the vendored z3
compiles regardless of the runner's Apple clang version.

zig only compiles z3 (built static); the final binary is linked by the
default system linker (ld64) because zig cannot link a macOS executable.
-fno-sanitize=all disables zig cc's default UBSan instrumentation, whose
__ubsan_handle_* symbols are otherwise unresolved at the system-link step.

Verified locally with zig 0.14.1 (the pinned CI version): otool -L on the
resulting binary shows only system libraries (libc++, libSystem, libiconv,
Security/CoreFoundation) -- no z3/Homebrew/opt -- and the binary runs
(openshell --version -> 0.6.3).
Lets cred_inject prepend a literal prefix (e.g. "Bearer ") to the
resolved credential value when composing the injected header, so a raw
stored token can emit `Authorization: Bearer <token>`. Wires the new
proto field through the YAML policy def, OPA serialization, and the
sandbox CredInjectDirective. Empty prefix is a no-op (back-compat).
)

* feat(sandbox): debug-gated L7 egress request/response header logging

* feat(sandbox): file log layer follows configured level (floored at info)

* feat(cli): --log-level on sandbox create sets SandboxSpec.log_level

* fix(sandbox): embed L7 egress header value in message so it renders in shorthand log
Build/clippy/test fixups required by the upstream catch-up rebase
(base 4848c40 -> f23c2c8), on top of the replayed fork-delta commits:

- supervisor-network split (NVIDIA#1650): thread image_sandbox_user through the
  now Result-returning gpu-default container build path; build_container_spec
  back to 3-arg to match volume tests.
- adopt upstream gpu model (NVIDIA#1835): fork gpu_device refs dropped, --volume kept.
- populate fork-added fields in NEW upstream test fixtures: NetworkPolicyRule
  .allowed_secrets and L7EvalContext .{cred_inject,echo,trust_cache,trust_check}.
- arc-swap dev-dependency for SecretResolver tests moved into openshell-core.
- proto refactor (NVIDIA#1565): use Sandbox::phase() accessor + sandbox_create arity
  (volumes/log_level) in CLI integration tests.

cargo build/clippy/test --workspace --features openshell-prover/bundled-z3 green on macOS.

Signed-off-by: Jakub Kovaľ <kuba.koval@gmail.com>
…r a prover/z3 dep)

The upstream catch-up made openshell-server depend on openshell-prover, which
pulls z3-sys. Both release jobs built the gateway with a plain `cargo build
-p openshell-server` (no z3 toolchain, no system libz3), so the gateway build
now fails with "z3.h not found". Mirror the CLI's bundled-z3 + zig setup onto
the gateway build step in both the Linux and macOS jobs so the released
gateway statically links z3 and has no runtime libz3 dependency.

Caught by the post-sync Mac dev-mode smoke (openlock buildFromSource hit the
same z3.h failure); fixed openlock-side too (fork-binaries.ts).

Signed-off-by: Jakub Kovaľ <kuba.koval@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.