sync: upstream catch-up (133 commits, base 2026-05-26) + re-home fork delta after supervisor split#10
Open
vessux wants to merge 152 commits into
Open
sync: upstream catch-up (133 commits, base 2026-05-26) + re-home fork delta after supervisor split#10vessux wants to merge 152 commits into
vessux wants to merge 152 commits into
Conversation
…elm-k3s-local (NVIDIA#1539) macOS ships bash 3.2 which lacks mapfile/readarray. Replace all three occurrences in configure_ghcr_credentials, cluster_has_image, and cluster_image_platform with a portable while-read loop, consistent with the fix applied to docker-build-image.sh in NVIDIA#1334.
Signed-off-by: Ann Marie Fred <afred@redhat.com>
This makes it so you can run the dev gateway and sandbox with: ``` mise run gateway # in another shell mise run sandbox ``` Signed-off-by: Kris Hicks <khicks@nvidia.com>
… L4/L7 split (NVIDIA#1412) * fix(sandbox): add mechanistic smoke test for L4 deny and document the L4/L7 split The old smoke script exercised an L7 PUT which hung because the denial aggregator is only wired to L4 CONNECT denies, not L7 enforcement. Add mechanistic-smoke.sh which triggers an L4 deny, waits for the aggregator to flush, and asserts a pending chunk appears under openshell rule get --status pending. Document the intentional L4-only scope of the mechanistic mapper in architecture/sandbox.md. Fixes NVIDIA#1333 Signed-off-by: mesutoezdil <mesudozdil@gmail.com> * refactor(smoke): remove redundant variable inits and merge double step call Signed-off-by: mesutoezdil <mesudozdil@gmail.com> * fix(smoke): wire mechanistic smoke into mise and guard TMP_DIR - Initialize TMP_DIR before trap to prevent unbound variable on early exit - Add e2e:mechanistic-smoke mise task with gateway setup - Document mechanistic smoke in policy-advisor README * test(proxy): verify L4 deny enqueues a DenialEvent Signed-off-by: mesutoezdil <mesudozdil@gmail.com> * fix(proxy): remove unnecessary path qualifications in L4 denial smoke test --------- Signed-off-by: mesutoezdil <mesudozdil@gmail.com>
Signed-off-by: Kris Hicks <khicks@nvidia.com>
Signed-off-by: Kris Hicks <khicks@nvidia.com>
NVIDIA#1585) On kernels without Landlock (e.g. gVisor's sentry returns ENOSYS for syscall 444), the previous best_effort path still logged "Applying Landlock" + "Landlock ruleset built" events even though no enforcement was happening. Probe at the top of `landlock::prepare` and short-circuit with a single High-severity "Sandbox Unavailable" finding. Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Kris Hicks <khicks@nvidia.com>
Documents the ServiceAccount, Role, and ClusterRole created by the Helm chart inline on the setup page, per reviewer feedback on NVIDIA#1250. Reflects the current chart templates including pods/get for sandbox identity and tokenreviews/create for projected token validation. Closes NVIDIA#1018
Signed-off-by: mjamiv <142179942+mjamiv@users.noreply.github.com>
) * feat(gateway): add readiness probe metrics and test-only store close Emit Prometheus readiness metrics for database probes (healthy gauge and outcome-labeled latency histogram) with coverage in health HTTP tests. Restrict Store::close behind test support cfg to prevent accidental runtime pool shutdown under live traffic. Signed-off-by: Adrien Langou <alangou@nvidia.com> * test(e2e): add simple e2e test with kubernetes to test /readyz Signed-off-by: Adrien Langou <alangou@nvidia.com> --------- Signed-off-by: Adrien Langou <alangou@nvidia.com>
Signed-off-by: Drew Newberry <anewberry@nvidia.com>
* fix(cli): preserve symlinks during sandbox upload * docs(sandboxes): document upload symlink behavior
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
… enforce at the router (NVIDIA#1596) * feat(server): per-handler gRPC auth annotations Move scope, role, and auth-mode metadata to the handler definition site via #[rpc_authz] + #[rpc_auth] proc macros. The previously hand-maintained SCOPED_METHODS, ADMIN_METHODS, UNAUTHENTICATED_METHODS, and ALLOWED_SANDBOX_METHODS tables are now generated from per-method annotations on the tonic service impls, with canonical gRPC paths derived from the service name and method name. Adds a new openshell-server-macros proc-macro crate, an aggregator in auth/method_authz.rs, and an exhaustiveness test that decodes the protobuf FileDescriptorSet (now emitted by openshell-core/build.rs) and verifies every RPC has an annotation. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * refactor(server): rename `sandbox-secret` auth mode to `sandbox` PR NVIDIA#1404 replaced the shared sandbox secret with per-sandbox gateway-minted JWTs. A handler marked `sandbox` now authenticates as a specific `Principal::Sandbox`, not as a holder of a shared credential. Rename `auth = "sandbox-secret"` to `auth = "sandbox"` and `AuthMode::SandboxSecret` to `AuthMode::Sandbox` so the name matches the post-NVIDIA#1404 identity model. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * fix(server): enforce per-handler AuthMode at the router Addresses review feedback on the per-handler auth-annotation work. - Router-level enforcement of #[rpc_auth] auth mode (HIGH). The previous router only checked is_sandbox_callable() for Principal::Sandbox; user principals still flowed into AuthzPolicy::check() and bypassed the per-handler declaration. A user with `openshell:all` could therefore reach `sandbox`-only handlers like GetSandboxProviderEnvironment, ReportPolicyStatus, PushSandboxLogs, and SubmitPolicyAnalysis even though their annotations said sandbox-only. Adds an is_user_callable() predicate and rejects User principals at the router for `sandbox` / `unauthenticated` methods. - Proc macro now errors on duplicate keys in #[rpc_auth(...)] (LOW). A second `auth`, `scope`, or `role` previously silently overwrote the first value; now it fails to compile. - Regression tests: a unit test for is_user_callable() and a router test that proves a user with admin role + openshell:all cannot reach the nine sandbox-only handlers. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * docs(server): finish renaming sandbox-secret to sandbox in method_authz doc comments Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * refactor(server-macros): drop standalone `rpc_auth` stub The stub was a safety net that fired only when a method had `#[rpc_auth(...)]` without an enclosing `#[rpc_authz]`. Triggering it required `rpc_auth` to be imported, which is why both call sites carried `#[allow(unused_imports)] use openshell_server_macros::{rpc_auth, rpc_authz};`. Drop the stub and the unused-import workaround. A missing `#[rpc_authz]` now surfaces as rustc's standard "cannot find attribute `rpc_auth` in this scope" — clear enough, and one fewer import + lint exception. Addresses review comment on PR NVIDIA#1596. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * refactor(server-macros): emit fixed `AUTH_METADATA` const per service The previous trait-derived const name turned `OpenShell` into `OPEN_SHELL_AUTH_METADATA`, splitting the project name across an underscore. Each impl already lives in its own module (`crate::grpc::`, `crate::inference::`), so the module path is enough to disambiguate between services — a fixed `AUTH_METADATA` name reads more naturally. Aggregator in `auth/method_authz.rs` now references `crate::grpc::AUTH_METADATA` and `crate::inference::AUTH_METADATA` directly. Addresses review comment on PR NVIDIA#1596. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * docs(server-macros): fix typo in AUTH_METADATA_CONST doc comment OpenShell is one word; reference name in the doc should be OPENSHELL_AUTH_METADATA, not OPEN_SHELL_AUTH_METADATA. Addresses review nit on PR NVIDIA#1596. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> --------- Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
…IA#1615) - Extract landing-page terminal demo into a reusable <CommandTerminal /> component with inline styles (no global CSS dependency) - Animate a second command line cycling through claude/opencode/codex via @Keyframes scoped inside the component - Inline BadgeLinks layout styles so the component renders correctly without relying on .badge-links from main.css - Add jsx.d.ts shim so editors do not flag the React global in component TSX files - Switch fern instance to global-theme: nvidia with multi-source enabled - Bump fern CLI to 5.40.0 and drop the basepath-aware experimental flag - Register fern/components/ as a second mdx-components directory - Remove the unused Adobe analytics script tag
Signed-off-by: Calum Murray <cmurray@redhat.com>
Signed-off-by: Calum Murray <cmurray@redhat.com>
* feat(helm): add optional PostgreSQL backing store with Secret-based credentials - Add postgres.enabled and postgres.deploy values to control database backend (SQLite vs PostgreSQL) and subchart deployment independently. - Introduce db-secret.yaml template for Opaque Secret with assembled postgresql:// connection string injected via OPENSHELL_DB_URL env var. - Add Bitnami PostgreSQL as optional subchart dependency keyed on postgres.deploy to prevent subchart deployment in external mode. - Externalize JWT signing key file mode via sandboxJwt.secretDefaultMode with 0400 default matching upstream. - Add validation guard for postgres.deploy=true without postgres.enabled. - Add helm unit tests covering internal, external, URL-override, special character encoding, and misconfiguration error paths. - Update README with Kubernetes and OpenShift install examples for bundled and external PostgreSQL configurations. - Add helm dependency build to lint and unittest tasks. * fix(helm): add database backend docs to README.md.gotmpl and regenerate The helm-docs CI check failed because the Database backend section was added directly to README.md instead of README.md.gotmpl. Move the content to the template and regenerate so the check passes. * fix(helm): use Secret-based DB credentials and support existingSecret Replace the inline db-url stringData pattern with a proper Secret containing individual fields plus a uri key. When postgres.deploy=true the Bitnami service-binding secret is referenced directly; when deploy=false users can supply postgres.external.existingSecret to bring their own Secret, or let the chart generate one from the external field values. Also restructures the README database section for clarity, adds helm-unittest coverage for the new secret resolution paths, and fixes a markdown lint issue in the root README. * refactor(helm): move OpenShift e2e script to e2e/rust/ and add mise task Move test-openshift-scenarios.sh from deploy/helm/openshell/ci/ to e2e/rust/e2e-openshift.sh, matching the existing e2e script naming convention. Register it as `e2e:openshift` in tasks/test.toml — not wired into the `test` or `e2e` aggregates so it only runs on explicit invocation against a live OpenShift cluster. * feat(e2e): add database backend scenarios to Kubernetes e2e Extend with-kube-gateway.sh with an optional multi-scenario loop gated by OPENSHELL_E2E_KUBE_DB_SCENARIOS=1. When enabled, the script installs the Helm chart three times — SQLite (default), bundled PostgreSQL, and external PostgreSQL with existingSecret — running the full test suite against each backend. When unset, existing single-install behavior is unchanged. Also adds helm dependency build before helm install, fixing CI failures caused by the missing PostgreSQL subchart dependency. * refactor(helm): simplify PostgreSQL config to two orthogonal controls Replace postgres.deploy and postgres.external.* with two simple controls: - postgres.enabled: deploy the bundled Bitnami PostgreSQL subchart - server.externalDbSecret: name of a pre-existing Secret with a uri key Delete db-secret.yaml — the chart no longer generates Secrets from individual credential fields. Users either get the Bitnami service-binding secret (bundled) or bring their own via server.externalDbSecret. Add validation that postgres.serviceBindings.enabled must stay true when using bundled PostgreSQL, preventing a confusing runtime failure.
* feat(build): add simple nix flake with formatter for nix code * feat(flake): setup rust toolchain, able to build and run unit tests * feat(flake): add support for arm linux and macos * feat(toolchain): add rust-src and rust-analyzer to the toolchain
…VIDIA#1565) * refactor(proto): move phase and current_policy_version into SandboxStatus Move phase and current_policy_version from SandboxSpec into SandboxStatus to correctly model mutable runtime state. Update all callers in the gateway server, TUI, and Python SDK to read and write these fields through SandboxStatus accessors. Signed-off-by: Derek Carr <decarr@redhat.com> * fix(server): preserve sandbox status on statusless driver updates When a driver update arrives without a status payload (e.g. before Kubernetes populates the status subresource), preserve the stored phase, conditions, and current policy version instead of resetting them. Adds a regression test covering the edge case. Signed-off-by: Derek Carr <decarr@redhat.com> --------- Signed-off-by: Derek Carr <decarr@redhat.com>
) * feat(python-sdk): support OIDC Bearer auth on SandboxClient PR NVIDIA#1596 hardened the gateway side of the OIDC story; the Python SDK was the remaining gap — it only supported plaintext or mTLS, with no Bearer metadata anywhere. Deployments with OIDC enabled (the recommended posture since PR NVIDIA#935 / PR NVIDIA#1404) were unreachable from the SDK. Adds: - `bearer_token: str | Callable[[], str] | None` kwarg on `SandboxClient`. Static strings or zero-arg callables (the latter is invoked per RPC, so callers can drop in a refresh loop or token-file watcher without reconstructing the client). Composes with `tls` for OIDC-over-mTLS deployments. - `_BearerAuthInterceptor` implementing all four `grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types. Appends `authorization: Bearer <token>` to outgoing metadata. Implemented as an interceptor (not call credentials) so it works on both plaintext (`disableTls=true` dev) and TLS channels without `grpc.composite_channel_credentials`. - `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`, `key_path`) are now optional with `cert_path` / `key_path` required-together-or-not-at-all (enforced in `__post_init__`). This unlocks three transport profiles from one dataclass: * full mTLS (all three) * CA-only trust (`ca_path` only) * system roots (`TlsConfig()` — for OIDC gateways behind a public CA) - `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs` `build_oidc_channel`: * For any `https://` gateway, always build a secure channel. Pick the strongest TLS profile available in `mtls/` (full mTLS → CA-only → system roots). No more `insecure_channel` fallback for HTTPS. * Gate OIDC bearer attachment on `metadata.json["auth_mode"] == "oidc"`. Matches `crates/openshell-cli/src/main.rs:132` and the TUI; a stale `oidc_token.json` next to a non-OIDC gateway no longer causes the SDK to attach a bearer. - `_OidcRefresher` — thread-safe, in-process native OAuth2 refresh modeled on `google.oauth2.credentials.Credentials` and `botocore.tokens.SSOTokenProvider`. Lazily checks expiry on every RPC; when stale, re-reads disk first (the CLI may have rotated the bundle), and only then exchanges the refresh_token against the IdP's token endpoint discovered via OIDC discovery (`/.well-known/openid-configuration`, cached after first call). Concurrent RPCs share a single refresh via `threading.Lock` (no IdP stampede). Honors refresh-token rotation. Surfaces IdP failures as `SandboxError` with the RFC 6749 error body included for diagnostics. Mirrors the Rust CLI's HTTP-policy posture from `crates/openshell-cli/src/oidc_auth.rs`: * `follow_redirects=False` so a 3xx during discovery can't steer us to an attacker-controlled token endpoint. * Discovery `issuer` is validated against the configured issuer; a discovery document claiming a different issuer is rejected, preventing the SDK from POSTing the refresh_token to a malicious endpoint. * `insecure: bool` flag plumbed through to httpx's `verify=` so self-signed-cert deployments work the same way they do in the Rust CLI. Built on `httpx` (chosen over `urllib` specifically for follow_redirects + verify control as kwargs). The OAuth2 refresh-token grant itself (RFC 6749 §6) is one form-encoded POST — handled inline rather than via a dedicated OAuth library; tried `authlib`'s `OAuth2Client` first but it auto-injects an Authorization header on every request, which breaks the unauthenticated discovery GET. - `_make_cluster_bearer_provider(..., auto_refresh=True, write_back=True, insecure=False)` factory. Defaults to the refresher path with write-back enabled; `auto_refresh=False` falls back to the read-only fail-closed behavior for callers that don't want the SDK to make outbound HTTP calls to the IdP. `write_back=True` is the default (changed from the first round of review): IdPs with refresh-token rotation (Keycloak with rotation, Entra in strict mode) invalidate the old refresh_token on each refresh, so an in-memory-only refresh would leave the on-disk bundle pointing at an invalidated value — any second process starting from disk would `invalid_grant`. With write-back enabled by default, the SDK keeps the shared cache consistent with the IdP. - `from_active_cluster` exposes `auto_refresh`, `write_back`, and `insecure` kwargs (defaults: True / True / False). The high-level `Sandbox` context manager surfaces the same three kwargs and forwards them through, so callers using the wrapper have parity with `SandboxClient` for OIDC-protected gateways. - `SandboxClient.close()` chains to a `_bearer_close` hook so the `_OidcRefresher`'s underlying `httpx.Client` is released deterministically instead of leaking sockets/FDs until GC runs `__del__`. Idempotent. - `_OidcRefresher._write_to_disk` uses `tempfile.mkstemp` (PID + random suffix) instead of a fixed `.oidc_token.json.tmp` path, so two writers racing on the same gateway directory don't trample each other's tmp content. Success path atomically replaces; failure path unlinks the orphan. OAuth2 refresh policy and write-back semantics deliberately mirror what the major Python SDKs do — see github.com/googleapis/google-auth-library-python (`Credentials`) and github.com/boto/botocore (`SSOTokenProvider`): | Library | Native refresh | Writes back | |-------------------------------|----------------|-------------| | google-auth Credentials | yes | no | | botocore SSOTokenProvider | yes | yes | | openshell SandboxClient (here)| yes (opt-out) | yes (opt-out)| OpenShell sits between the two; chose write-back-by-default because the rotation invariant matters more for our deployments than the "CLI is the only writer" assumption that fits google-auth. Adds `httpx>=0.27` as a runtime dependency. No new OAuth2 library — the refresh grant is a single POST. Tested: - 42 sandbox_test.py tests pass (5 pre-existing + 37 new across the bearer interceptor, fail-closed provider, refresher behavior, TlsConfig validation, from_active_cluster auth ladder, security-review regressions, Sandbox-wrapper kwarg forwarding, and lifecycle / concurrency probes). `mise run test:python` → 47 passed total across the python suite. - `mise run python:lint` (ruff) clean. - End-to-end against a Keycloak-protected gateway on OpenShift: * unauthenticated `Health` bypass works * admin + `openshell:all` reaches user-callable methods * reader (`sandbox:read`) denied on `CreateSandbox` by scope * admin + `openshell:all` denied on PR NVIDIA#1596 sandbox-only methods at the router (the new gate is honored from the SDK) * full provider CRUD lifecycle via the SDK * callable token provider rotates per RPC as expected - Regression-probed against three pre-review security findings: * **Discovery issuer validation** — a discovery document claiming a different `issuer` than the configured one is rejected with a clear `SandboxError` before any refresh POST can reach the attacker-controlled endpoint. * **Redirect during discovery** — `follow_redirects=False` on the underlying httpx client means a 3xx during discovery surfaces as a SandboxError rather than silently chasing the redirect. * **Cross-process rotation** — a two-process simulation shows process B starting from disk and successfully refreshing with the rotated refresh_token, because process A's write-back updated the shared cache. - Refresher unit tests cover: cached-fresh fast path, disk-rotated re-read before refresh, OAuth2 exchange against the discovered token endpoint, refresh-token rotation, atomic write-back at 0600 mode (default), default-on write_back proven by test, concurrent N-thread coordination (one refresh shared across 8 threads), IdP failure surfaced with the error body, the client_credentials / no-refresh_token error path, issuer- mismatch rejection, redirect-during-discovery rejection, insecure flag plumbing. - Lifecycle / concurrency regression tests added: `close()` invokes the `_bearer_close` hook (idempotent), the refresher's `httpx.Client` is marked closed after `SandboxClient.close()`, and 16 concurrent writers don't leave orphan tmp files behind while producing a valid final bundle. The `Sandbox` wrapper has direct forwarding tests proving `auto_refresh`, `write_back`, and `insecure` reach `from_active_cluster` (both explicit values and defaults). - End-to-end against a real OpenShift + Keycloak cluster from inside a pod: real OIDC discovery against `keycloak.keycloak.svc.cluster.local:8080`, refresh-token grant POST, atomic write-back of the rotated bundle at 0600, and a follow-up RPC reusing the freshly-rotated in-memory token — full round-trip in ~170ms. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * fix(python-sdk): adopt newer on-disk OIDC bundle before refreshing _OidcRefresher.current_access_token() only adopted the on-disk oidc_token.json when its access token was still fresh; otherwise it refreshed using the in-memory bundle. With refresh-token rotation enabled (Keycloak with rotation, Entra strict mode), this let a process keep using an invalidated refresh_token: 1. Process A holds a stale in-memory bundle with refresh_token=r1. 2. Process B refreshes first and writes a rotated (r2) but now near-expiry bundle to disk. 3. Process A re-reads disk, sees the access token is not fresh, ignores the disk bundle, and POSTs the stale r1 — which the IdP has already invalidated, yielding invalid_grant. Fix: when the cached bundle is stale, adopt the on-disk bundle if it was refreshed more recently than ours, even when its access token is also stale. "More recently" is decided by expires_at — a refresh mints a new access token with a forward expiry alongside the rotated refresh_token, so the later expiry carries the newest refresh_token. Comparing by expiry (rather than unconditionally preferring disk) preserves the write_back=False case, where the in-memory bundle has already rotated past the on-disk copy and must not be clobbered. When the adopted bundle's issuer differs, the cached token endpoint is reset so the refresh re-discovers against the new issuer. Adds regression tests for the cross-process rotation race and the issuer-change re-discovery path. * fix(python-sdk): recover from invalid_grant on lost rotation race The expiry-based disk re-read narrows but does not fully close the cross-process refresh-token rotation race: two processes sharing a gateway directory can both enter their refresh window, both POST their copy of the refresh_token, and with rotation enabled the IdP invalidates the loser's token (invalid_grant). Neither google-auth nor botocore close this window without an OS file lock; a Python-only flock would not coordinate with the Rust CLI/TUI that also write oidc_token.json, so locking is not worth its cost here. Recover instead of prevent: distinguish an OAuth2 invalid_grant (the refresh_token was rejected) from transport/5xx failures via a private _InvalidGrantError, and on invalid_grant re-read oidc_token.json once. If a peer wrote a different refresh_token (it won the race), adopt and retry with it — returning early if it is already fresh — so the loser succeeds transparently instead of forcing a re-authenticate. If disk offers no new token, the rejection is genuine and surfaces the re-authenticate hint as before. The retry is single-shot; a second invalid_grant propagates. Adds tests for the peer-rotation recovery and the genuine-rejection (no-retry) paths. --------- Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
…#1627) Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
…1623) Rootless Podman sandbox containers reach the host through pasta's local connection bypass, which translates L2 frames to L4 host sockets. The dev gateway script binds to 127.0.0.1 by default, which is not routable through pasta. Auto-detect rootless mode and bind to 0.0.0.0 so sandbox containers can connect to the gateway. - Auto-detect rootless Podman in gateway.sh and export OPENSHELL_BIND_ADDRESS=0.0.0.0 when not explicitly set - Add e2e:podman:rootless mise task and CI matrix entry to validate rootless Podman networking end-to-end - CI creates a non-root user inside the privileged container to trigger Podman's rootless code paths (pasta, user namespace isolation) Signed-off-by: Naveen Malik <nmalik@redhat.com>
Prefer a single CDI-qualified device when Docker or Podman resolves the default GPU request to one GPU. Allow nvidia.com/gpu=all only as a WSL2 all-only compatibility fallback, using Docker daemon info and Podman's /dev/dxg probe to identify that case. Update driver docs, architecture notes, and GPU e2e coverage for the default selection behavior. Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
* feat(server): support TLS certificate hot-reload Signed-off-by: Yuedong Wu <dwcn22@outlook.com> * refactor(server): extract shared TLS test utilities Signed-off-by: Yuedong Wu <dwcn22@outlook.com> --------- Signed-off-by: Yuedong Wu <dwcn22@outlook.com>
…IA#1902) * feat(providers): add DeepInfra as a built-in inference provider (v2 only) - Adds `deepinfra` as a built-in Providers v2 profile (`providers/deepinfra.yaml`) with inference category, Bearer auth, and `DEEPINFRA_API_KEY` discovery - Adds `DEEPINFRA_PROFILE` to inference routing so `inference.local` works with the `deepinfra` provider type - Fixes `build_backend_url` to strip `/v1` from request paths when the base URL contains `/v1/` as an internal segment (e.g. `api.deepinfra.com/v1/openai`), preventing double-versioned paths like `.../v1/openai/v1/chat/completions` - Updates `docs/sandboxes/providers-v2.mdx` and `docs/sandboxes/manage-providers.mdx` with DeepInfra entries; removes the old v1 workaround row that used `openai` type with `OPENAI_API_KEY` Signed-off-by: Milos Milutinovic <codemastermilos@gmail.com> * fix(providers): address gator review findings for DeepInfra provider - Narrow build_backend_url /v1 dedupe to URLs whose path component is exactly /v1 or starts with /v1/ — prevents regression on proxy endpoints where /v1 is buried deeper (e.g. /api/v1/openai); add regression test for the nested proxy path case - Add deepinfra provider plugin with DEEPINFRA_API_KEY discovery, registered in ProviderRegistry so known_types() and TUI include it - Add deepinfra to unsupported-inference-provider error message in openshell-server for accurate user-facing debugging guidance - Add deepinfra to openai_compatible_profiles_include_embeddings test to lock in the OpenAI-compatible protocol contract Signed-off-by: Milos Milutinovic <codemastermilos@gmail.com> * fix(router): handle /v1 as final path segment in build_backend_url dedup Extends the /v1 deduplication logic to also strip /v1 from request paths when the base URL's path ends with /v1 (e.g. https://api.groq.com/openai/v1). The previous fix only matched paths starting with /v1/, which regressed providers like Groq whose base path has /v1 as the last segment rather than the first. The nested-proxy exclusion (e.g. /api/v1/openai) is preserved since /v1 appears in the middle — neither first nor last segment. Adds a regression test for the Groq-style base URL. Signed-off-by: Milos Milutinovic <codemastermilos@gmail.com> * fix(providers): add deepinfra telemetry bucket and update profile list test - Add DeepInfra variant to ProviderProfile telemetry enum and from_raw() mapping so deepinfra providers are tracked in their own bucket rather than falling through to Custom - Map deepinfra in telemetry_provider_profile() in openshell-server - Add deepinfra to list_provider_profiles_returns_built_in_profile_categories test (sorted between cursor and github) - Update architecture/gateway.md inference provider list to include deepinfra Signed-off-by: Milos Milutinovic <codemastermilos@gmail.com> * style(router): apply cargo fmt to backend.rs Signed-off-by: Milos Milutinovic <codemastermilos@gmail.com> --------- Signed-off-by: Milos Milutinovic <codemastermilos@gmail.com>
* docs(rfc): require issues before RFCs Signed-off-by: Drew Newberry <anewberry@nvidia.com> * docs(rfc): correct RFC statuses Signed-off-by: Drew Newberry <anewberry@nvidia.com> * docs(rfc): mark template accepted Signed-off-by: Drew Newberry <anewberry@nvidia.com> * Update rfc/README.md Co-authored-by: krishicks <khicks@nvidia.com> * docs(rfc): document accepted RFC project tracking Signed-off-by: Drew Newberry <anewberry@nvidia.com> * docs(rfc): reframe RFC discussion guidance Signed-off-by: Drew Newberry <anewberry@nvidia.com> * docs(rfc): label issues when assigning RFCs Signed-off-by: Drew Newberry <anewberry@nvidia.com> * docs(rfc): link labeled RFC issues to board Signed-off-by: Drew Newberry <anewberry@nvidia.com> * docs(rfc): clarify RFC labels and board tracking Signed-off-by: Drew Newberry <anewberry@nvidia.com> --------- Signed-off-by: Drew Newberry <anewberry@nvidia.com> Co-authored-by: krishicks <khicks@nvidia.com>
…IA#1929) Signed-off-by: Drew Newberry <anewberry@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This fixes an issue where you may run e.g. `mise run e2e:python`, then after Python is upgraded in mise.toml, subsequent runs of `e2e:python` fail because the Python version is out of sync. Signed-off-by: Kris Hicks <khicks@nvidia.com>
* feat: build CLI during pull request Fixes NVIDIA#1454 Signed-off-by: Jeff MAURY <jmaury@redhat.com> * fix: removed secrets passing Signed-off-by: Jeff MAURY <jmaury@redhat.com> * fix: fix wrong conflict resolution Signed-off-by: Jeff MAURY <jmaury@redhat.com> * fix: apply suggestion from @TaylorMutch Co-authored-by: Taylor Mutch <taylormutch@gmail.com> * fix: remove doubled quote Signed-off-by: Jeff MAURY <jmaury@redhat.com> * fix: sync mise.lock Signed-off-by: Jeff MAURY <jmaury@redhat.com> --------- Signed-off-by: Jeff MAURY <jmaury@redhat.com> Co-authored-by: Taylor Mutch <taylormutch@gmail.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
* fix(helm): build chart dependencies before lint Signed-off-by: Evan Lezar <elezar@nvidia.com> * test(e2e): remove python gpu smoke test Remove the Python GPU smoke test and its fixture. The e2e:k3s:gpu task only depended on e2e:python:gpu and did not have a separate k3s implementation, so remove that stale alias with the task it pointed at. Signed-off-by: Evan Lezar <elezar@nvidia.com> (cherry picked from commit 221a103) --------- Signed-off-by: Evan Lezar <elezar@nvidia.com>
Five-feature stack for outbound credential management and
package-trust evaluation, used by openlock as its sandbox runtime:
- cred_inject: policy-declared header strip-and-replace for outbound
HTTP requests. Sandboxes never see real provider tokens; they
resolve openshell:resolve:env:* sentinels at policy time.
- per-binary credential scoping: allowed_secrets on network rules
gates which credentials each /usr/bin/<x> can resolve. Prevents
/usr/bin/gh from seeing ANTHROPIC_API_KEY etc.
- ArcSwap live refresh: lock-free hot-swap of provider credentials
at runtime via background poll. In-flight requests see snapshot.
- echo endpoint: policy-driven echo: true short-circuits upstream
connect; returns post-rewrite headers as JSON for offline wire
proofs.
- trust_check hook: pre-flight package-registry trust evaluation
via deps.dev (CVSS, license, staleness). Rego rules deny on
critical CVEs; audit on high; fail-open on lookup error.
This is the source-of-truth squash for openlock's openshell fork
delta. Per-feature design lives in
openlock/docs/superpowers/specs/{2026-05-01..03}-*.md.
Builds openshell-gateway, openshell-sandbox, and openshell CLI on public GitHub-hosted runners (ubuntu-24.04, ubuntu-24.04-arm, macos-14) and attaches them to a GitHub Release on tag push. Distinct from upstream's release-tag.yml, which targets NVIDIA's self-hosted infrastructure. Bundled-z3 used for the CLI to avoid system z3 dependency on each runner.
Bundled-z3 vendors a z3 source tree whose obj_hashtable.h hits overload-resolution errors with the clang on macos-14 (Apple clang) and likely modern Linux clang too. Install libz3-dev / brew z3 and build the CLI against the system z3 instead.
Avoids the manual 'gh release edit --prerelease' step on every RC. softprops/action-gh-release reads the prerelease flag from inputs; gate on tag-name suffix detection. Mirrors openlock f95c3e6.
…ce struct_excessive_bools After the upstream sync rebase, NetworkPolicyRule's fork-added allowed_secrets field was missing from 13 test-fixture initializers in merge.rs. NetworkEndpointDef in lib.rs now has >3 bool fields (fork's echo + upstream's websocket/body credential rewrite flags); silence the clippy::struct_excessive_bools lint — these are independent feature flags, not a state-machine candidate. Also simplify three raw string literals (remove unnecessary hashes): the YAML content in allowed_secrets, cred_inject, and trust_check tests doesn't need the escape-hashes.
After the upstream sync rebase, test code across openshell-sandbox and openshell-server constructs NetworkPolicyRule, NetworkEndpoint, L7EvalContext, and L7EndpointConfig without the fork-added fields (allowed_secrets, cred_inject, echo, trust_check, trust_cache). Add safe defaults. Also remove dead fetch_provider_environment helper (unused after ArcSwap poll-loop removal) and unused arc_swap import; pass None to the trust parameter of evaluate_l7_request at one more call site.
) * feat(cli): add --volume flag to sandbox create Register a repeatable `--volume <HOST>:<CONTAINER>[:ro]` clap arg on `SandboxCommands::Create`. Adds an `#[allow(clippy::large_enum_variant)]` to suppress the size-difference lint that fires now that Create has multiple Vec fields. The new field is destructured as `_volumes` in the dispatch match arm; plumbing through the create flow is deferred to a follow-up task. * refactor(cli): tighten --volume flag test (in-process) + verify clippy allow - Replace dead openshell_bin() fallback with env!("CARGO_BIN_EXE_openshell") so a missing binary fails at compile time rather than panicking at runtime with a space-in-path exec error. - Strengthen the three parse-acceptance tests: assert exit code == 1 (no gateway configured) rather than != 2, pinning the exact failure mode and ruling out silent success (exit 0) as well as clap errors (exit 2). - Drop OPENSHELL_ENDPOINT env var from parse tests; the "No active gateway" runtime check fires first (exit 1) before any network I/O. - Add module-level doc comment explaining why subprocess is retained over in-process (Cli is private; run::sandbox_create bypasses clap). - Verify large_enum_variant clippy allow: lint fires at 297 bytes (Create) vs 78 bytes (next variant). Update comment to cite actual sizes instead of "by design". * feat(cli): parse --volume HOST:CONTAINER[:ro] specs into typed struct Introduces volume_spec module with BindVolumeSpec, VolumeParseError, and parse_volume_spec, validated by 7 unit tests covering both happy paths and all error variants (bad field count, bad ro token, non-absolute/missing host, non-absolute container). * feat: plumb --volume specs from CLI through proto to DriverSandbox Add BindVolume message to openshell.proto (SandboxSpec.volumes = 11) and compute_driver.proto (DriverSandboxSpec.volumes = 11). Wire parsed BindVolumeSpec entries from the CLI handler through run::sandbox_create into the CreateSandboxRequest, and map them to DriverBindVolume in driver_sandbox_spec_from_public on the server side. Fix the docker driver test fixture and integration test call sites for the updated sandbox_create signature. * feat(podman): emit bind mounts + auto userns=keep-id when volumes present Append a libpod bind-mount entry for each DriverSandboxSpec.volume, and set userns=keep-id:uid=N,gid=N (from the image USER directive) when any volumes are present. Falls back to uid/gid 1000660000 for unset or non-numeric USER directives. * refactor(podman): tighten image_user parsing + extract COMMUNITY_SANDBOX_UID Extract the magic number 1_000_660_000 into a named pub constant COMMUNITY_SANDBOX_UID so all four usage sites share a single source of truth with a doc-comment linking to the Dockerfile origin. Fix a split-uid bug in image_user: when Config.User is ":1000", the uid field is "" which fails to parse as u32. Previously gid was still extracted from parts[1], producing a mismatched (1_000_660_000, 1000) pair. Now the function returns the fallback pair immediately on any uid parse failure, matching the documented contract. Expand the image_user doc-comment to cover the uid=0/root case (keep-id:uid=0,gid=0 is harmless on rootless Podman), the empty-user case, and the non-numeric-user case. Add an rbind/read-write assertion on bind_mounts[0] in the existing build_container_spec_emits_bind_mount_entries test. * feat(drivers): docker -v passthrough + vm bind reject for --volume Extend DockerDriver's build_binds to append user-declared bind volumes from sandbox.spec.volumes using host:container[:ro] format. Extend validate_vm_sandbox to reject requests that carry any volumes with Status::invalid_argument. * docs(volume): e2e tests + podman driver README + user docs for --volume Add two e2e tests for `openshell sandbox create --volume` (rw round-trip and ro write-block) gated behind the new `e2e` Cargo feature. Add bind-volume section to the podman driver README documenting auto userns-remap. Add `--volume` flag documentation and userns-remap note to the user-facing manage-sandboxes page and the compute-runtimes architecture doc.
required-ci-gates.yml posts five "Waiting for /ok to test mirror" pending statuses on every PR (Branch Checks, Helm Lint, Branch E2E Checks, GPU Test, Branch Kubernetes E2E). These never resolve on this fork because two pieces of upstream infrastructure are missing: - No `/ok to test <SHA>` issue-comment handler exists in our .github/workflows/. Upstream runs it as an external GitHub App that we cannot import. - No fork-side self-hosted runners are registered, so the worker workflows (branch-checks.yml, helm-lint.yml, branch-e2e.yml, test-gpu.yml, branch-kubernetes-e2e.yml) cannot execute even when manually dispatched: they target `linux-amd64-cpu8` / `linux-arm64-cpu8` and a private `ghcr.io/nvidia/openshell/ci` container image. e2e-label-help.yml is the partner workflow that posts a hint comment when a `test:e2e*` label is applied. Without the `/ok` handler the hint points at a command that does nothing on this fork. Delete both. The worker workflows stay in place: they are inert (unrunnable for the reasons above) and removing them would create a larger surface to merge back when upstream changes. Effect: PRs no longer show two perpetually-yellow checks. The fork's real CI signal (CodeQL, openlock-release, ci-image, etc.) is unaffected.
The driver-layer stop_sandbox already exists for every compute backend (podman, docker, kubernetes, vm) and the internal compute_driver.proto exposes a StopSandbox RPC. But the gateway-facing openshell.proto only surfaces DeleteSandbox, which destroys the workspace volume alongside the container. Operators who want to halt a sandbox for resource reasons have to delete it and lose state. This patch adds: - StopSandbox and StartSandbox RPCs to proto/openshell.proto with matching Request/Response messages. - Compute-runtime wrappers (stop_sandbox, start_sandbox) that look up the sandbox by name, forward stop to the driver, and forward start to the StartupResume hook so the existing Docker resume path is reused. - A resume_sandbox method on PodmanComputeDriver mirroring the Docker implementation, plus a StartupResume impl wired in new_podman so the Start verb works on podman too. - gRPC handlers (handle_stop_sandbox, handle_start_sandbox), trait wires, and matching authz entries (both gated on sandbox:write). - CLI verbs: openshell sandbox stop [--all] and openshell sandbox start with completion for sandbox names and the same forward-cleanup behavior as delete. Phase is left to the existing watch loop to update as containers transition; no new SandboxPhase variant is required. Restart-from- stopped works mid-session via Start and at gateway boot via the existing resume_persisted_sandboxes sweep. Tests: - Podman driver: resume_sandbox returns Ok(false) when container is missing, is a no-op when already running, and issues POST /start when stopped. - Compute wrappers: NotFound for missing sandboxes, Unimplemented when no StartupResume hook is configured, forwards to the hook with the correct id/name, surfaces the hook's Ok(false) verdict. - gRPC handlers: empty-name InvalidArgument and missing-sandbox NotFound for both Stop and Start. The seven mock OpenShell impls in integration tests are updated to satisfy the expanded trait surface.
…rom crash (#5) Add a sixth `SandboxPhase` variant so callers can tell an explicit `openshell sandbox stop` apart from a container that crashed. Previously both surfaced as `Error`, leaving consumers (the openlock reattach path in particular) unable to decide whether to resume or surface the failure. Implementation: `stop_sandbox` stamps an `openshell.io/stop-requested` label on the persisted sandbox before invoking the driver; the watch loop's `apply_sandbox_update_locked` then maps the ensuing `ContainerExited` Ready=false condition to `Stopped` instead of `Error` when the label is present. `start_sandbox` clears the label only after a successful resume so failed resumes don't get misclassified. The supervisor-session reconciler skips sandboxes already in `Stopped` to avoid silently reanimating them. Six unit tests cover the new behavior: stamping, clearing, persistence when backend is missing, the error→stopped override, and the no-label case still surfacing as Error.
The cred-inject fields already use the 9000+ range reserved for openlock fork additions, but the later bind-mount `volumes` field and the `SANDBOX_PHASE_STOPPED` enum value grabbed the next sequential numbers (11 and 6). That put them in upstream's path: upstream's per-sandbox-auth work (NVIDIA#1404) took DriverSandboxSpec field 11 (`sandbox_token`), colliding with `volumes`, and upstream could extend SandboxPhase past 5 at any time. Move both into the fork range so the delta is permanently collision-proof and the convention is uniform: - SandboxSpec.volumes 11 -> 9003 (openshell.proto) - DriverSandboxSpec.volumes 12 -> 9003 (compute_driver.proto) - SANDBOX_PHASE_STOPPED 6 -> 9004 (openshell.proto) Safe to renumber: `volumes` is transient provisioning input and the phase is re-derived from backend state on every watch (never persisted as the enum int), so no gateway-DB upgrade-path break. Gateway, CLI, and sandbox binaries always ship as one matched fork tag, so there is no wire skew.
…OKEN (fix macOS 403)
The macOS release CLI dynamically linked Homebrew's libz3.4.15.dylib (Z3_SYS_Z3_HEADER=/opt/homebrew/include/z3.h), so it failed at startup on a clean Mac with no brew z3 (dyld: Library not loaded). Mirror the Linux job instead: build z3 from source and statically link it via --features bundled-z3, using zig as the C/C++ compiler so the vendored z3 compiles regardless of the runner's Apple clang version. zig only compiles z3 (built static); the final binary is linked by the default system linker (ld64) because zig cannot link a macOS executable. -fno-sanitize=all disables zig cc's default UBSan instrumentation, whose __ubsan_handle_* symbols are otherwise unresolved at the system-link step. Verified locally with zig 0.14.1 (the pinned CI version): otool -L on the resulting binary shows only system libraries (libc++, libSystem, libiconv, Security/CoreFoundation) -- no z3/Homebrew/opt -- and the binary runs (openshell --version -> 0.6.3).
Lets cred_inject prepend a literal prefix (e.g. "Bearer ") to the resolved credential value when composing the injected header, so a raw stored token can emit `Authorization: Bearer <token>`. Wires the new proto field through the YAML policy def, OPA serialization, and the sandbox CredInjectDirective. Empty prefix is a no-op (back-compat).
) * feat(sandbox): debug-gated L7 egress request/response header logging * feat(sandbox): file log layer follows configured level (floored at info) * feat(cli): --log-level on sandbox create sets SandboxSpec.log_level * fix(sandbox): embed L7 egress header value in message so it renders in shorthand log
Build/clippy/test fixups required by the upstream catch-up rebase (base 4848c40 -> f23c2c8), on top of the replayed fork-delta commits: - supervisor-network split (NVIDIA#1650): thread image_sandbox_user through the now Result-returning gpu-default container build path; build_container_spec back to 3-arg to match volume tests. - adopt upstream gpu model (NVIDIA#1835): fork gpu_device refs dropped, --volume kept. - populate fork-added fields in NEW upstream test fixtures: NetworkPolicyRule .allowed_secrets and L7EvalContext .{cred_inject,echo,trust_cache,trust_check}. - arc-swap dev-dependency for SecretResolver tests moved into openshell-core. - proto refactor (NVIDIA#1565): use Sandbox::phase() accessor + sandbox_create arity (volumes/log_level) in CLI integration tests. cargo build/clippy/test --workspace --features openshell-prover/bundled-z3 green on macOS. Signed-off-by: Jakub Kovaľ <kuba.koval@gmail.com>
…r a prover/z3 dep) The upstream catch-up made openshell-server depend on openshell-prover, which pulls z3-sys. Both release jobs built the gateway with a plain `cargo build -p openshell-server` (no z3 toolchain, no system libz3), so the gateway build now fails with "z3.h not found". Mirror the CLI's bundled-z3 + zig setup onto the gateway build step in both the Linux and macOS jobs so the released gateway statically links z3 and has no runtime libz3 dependency. Caught by the post-sync Mac dev-mode smoke (openlock buildFromSource hit the same z3.h failure); fixed openlock-side too (fork-binaries.ts). Signed-off-by: Jakub Kovaľ <kuba.koval@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Upstream-sync rebase: fork delta re-parented onto
upstream/main(4848c409→f23c2c8e, 133 commits absorbed). Per the fork-sync flow this PR is a review artifact, not a merge — it will showCONFLICTINGbecause the rebased branch cannot fast-forward intomain. Land via the force-push below.17 fork-delta commits replayed + 1 adaptation commit. The stale
style: cargo fmtcommit was dropped (re-formatted in the adaptation pass).Verification (macOS)
cargo build/clippy/test --workspace --features openshell-prover/bundled-z3— green, clippy zero warnings, all tests pass (722/723 + ignored gateway/podman integration tests).cred_inject=9000, echo=9001, trust_check=9002, volumes=9003, SANDBOX_PHASE_STOPPED=9004); zero upstream collisions in the 8000/9000 range.Conflict surfaces resolved
sandboxintoprocessandnetworksubcrates. NVIDIA/OpenShell#1650 openshell-sandbox → supervisor process/network split (biggest): cred-inject + L7 egress + trust stack re-homed intoopenshell-supervisor-network(git ORT auto-followed most renames).trust.rs(fork-new) relocated to the network crate;secrets.rstracked toopenshell-core;trust_cachewiring moved intonetwork/run.rs.L7EvalContextfield-set merged (upstreamactivity_tx/dynamic_credentials/token_grant_resolver+ forkcred_inject/echo/trust_cache/trust_check).gpubool model; dropped fork's now-removedgpu_devicerefs (vm-driver validation reduced to the--volumerejection; template checks already exist upstream asvalidate_vm_template_request).Sandbox::phase()accessor used where the field moved into status.--volume× upstream container refactor: threadedimage_sandbox_userthrough the nowResult-returningbuild_container_spec_with_token_and_gpu_default.#[rpc_auth(bearer, sandbox:write, user)]; obsolete inline authz consts dropped (moved tomethod_authz).e2e-label-help.yml(modify/delete).Not landing via this PR
Force-push to
mainis the human step (default-branch protection):Post-merge: tag
vX.Y.0→ release workflow builds binaries → bumpOPENSHELL_FORK_TAGin openlocksrc/sandbox/fork-binaries.ts; smoke-test a fresh gateway on Mac/podman.